Using Machine Learning for Text Processing in Natural Languages

machine learning

Today we will touch an interesting topic of machine learning of natural languages. Nowadays very large investmenst are made in this region and many different tasks are being solved. This topic attracts the attention not only of industry, but also of the scientific community.

Can the machine think?

Researchers correlate the analysis of natural languages ​​with the fundamental question: can a machine think? The famous philosopher Rene Descartes gave a unambiguously negative answer. Not surprising, given the level of development of technology of the XVII century. Descartes believed that the machine does not know how and will never learn to think. The machine will never be able to communicate with a person through natural speech. Even if we explain to it how to use and pronounce words, it will still be memorized phrases, standard answers - the machine does not go beyond them.

Turing test

Since then, many years have passed, the technology has changed quite a lot, and in the twentieth century this issue again became relevant. The well-known scientist Alan Turing in 1950 doubted that the machine can not think, and for testing offered his famous test.

The idea of ​​the test, according to legend, is based on the game that was practiced at student parties. Two people from the company - a guy and a girl - went to different rooms, and the remaining people communicated with them with the help of notes. The task of the players was to guess with whom they are dealing: with a man or with a woman. And the guy with the girl pretended to each other to mislead the other players. Turing made a fairly simple modification. He replaced one of the hidden players with a computer and invited the participants to recognize with whom they interact: with a person or with a machine.

The Turing test was invented more than half a century ago. Programmers have repeatedly stated that their offspring passed the test. Every time there were controversial demands and questions, is it really so. The official reliable version, whether someone coped with the main test of Turing is no. Some of its variations have actually been successfully passed.

Georgetown Experiment

In 1954 the Georgetown experiment was held. In its framework, a system was demonstrated that automatically translated 60 sentences from Russian into French. The organizers were sure that in just three years they would achieve a global goal: they will completely solve the problem of machine translation. And failed miserably. After 12 years, the program was closed. No one came close to solving this problem.

From the current position, one can say: the main problem was a small number of proposals. In this case, it is almost impossible to solve the problem. And if the experimenters had experimented on 60 thousand or, maybe, even on 6 million proposals, then they would have a chance.

First chat bots

In the 1960s, the first chat bots appeared, very primitive: they basically paraphrased what the other person was saying to them. Modern chat bots are not far from their ancestors. Even the famous chat-bot Zhenya Gustman, who is believed to have passed one of the versions of the Turing test, did so not because of cunning algorithms. What helped the acting skill: the authors thought out his personality well.

Formal ontologies, Chomsky grammar theory

Then came the era of formal methods. It was a global trend. The scientists tried to formalize everything, construct a formal model, ontology, concepts, connections, general rules of syntactic analysis and universal grammar. Then the theory of Chomsky grammars arose. All this looked very beautiful, but it did not reach an adequate practical application, because it required a lot of painstaking manual work. Therefore, in the 1980s, attention was shifted to another class of systems: machine learning algorithms and so-called corpus linguistics.

Machine learning and corpus linguistics

What is the main idea of ​​corpus linguistics? We are building a corpus - a collection of documents, large enough, and then using machine learning methods and statistical analysis, we are trying to build a system that will solve our problem.

In the 1990s, this area received a very powerful push thanks to the development of the World Wide Web with a large number of poorly structured text, which needed to be searched, it had to be cataloged. In 2000, the analysis of natural languages ​​began to be used not only for searching the Internet, but also for solving various problems. Large datasets with text appeared, a lot of various tools, companies began to invest a lot of money in the research.

Modern trends in machine learning

What is happening now in the field of machine learning software? The main trends that can be distinguished in the analysis of natural languages ​​are the active use of models of teaching without a teacher. They make it possible to reveal the structure of a text, some corpus without preset rules. In the open there are many large available cases of different quality, marked and not. There were models based on crowdsourcing: we not only try to understand something by means of the machine, but we connect people who for a small fee will determine which language the text is written. The idea of ​​using formal ontologies has begun to revive, but now ontologies revolve around crowdsourcing knowledge bases, in particular databases based on Linked Open Data. This is a whole set of knowledge bases, its center is a machine-readable version of Wikipedia DBpedia, which is also filled with a crowdsourcing model. People all over the world can add something to it.

About six years ago, NLP (natural language processing) basically absorbed techniques and methods from other areas, but over time it began to export them. The methods that developed in the field of the analysis of natural languages ​​began to be successfully applied in other areas. And of course, where can we go without deep learning? Now, in the analysis of natural languages, too, deep neural networks are beginning to be applied, still with varying success.

What is NLP? It can not be said that NLP is a specific task. NLP is a huge range of tasks of different levels. On the level of detail, for example, you can break them down like this.

At the signal level, we need to convert the input signal. It could be a speech, a manuscript, a printed scanned text. It is required to convert it into a record consisting of characters with which the machine will be able to work.

Next comes the level of the word. Our task is to understand that there is a word in general, to carry out its morphological analysis, to correct errors, if they exist. Slightly higher - the level of word combinations. On it there are parts of speech that you need to be able to define, the problem arises of recognizing named entities. In some languages, even the task of extracting words is not trivial. For example, in German there is not necessarily a space between words, and we need to be able to isolate words from a long record.

Phrases are translated to sentences by software. We need to distinguish them, sometimes - to parse, to try to formulate the answer, if the sentence is interrogative, eliminate the ambiguity of the words, if required.

It should be noted that these tasks go in two directions: related to analysis and generation. In particular, if we found the answer to the question, we need to create a proposal that will adequately look from the point of view of the person who reads it and answer the question.

The proposals are grouped into paragraphs, and here the question of permitting references and establishing relations between the objects mentioned in different proposals already arises.

With paragraphs, we can solve new problems: analyze the emotional color of the text, determine in which language it is written.

Paragraphs form the document. At this level, the most interesting tasks are arise. In particular, the semantic analysis (what is the document about?), The generation of automatic annotation and automatic summary, translation and creation of documents. All probably heard about the well-known generator of scientific articles SCIgen, who created the article "The Rooter: The Algorithm of Typical Unification of Access Points and Redundancy". SCIgen regularly tests the editorial boards of scientific journals.

There are also tasks related to the corpus as a whole. In particular, to deduplicate a huge body of documents, look for information in it, and so on.