Skip to main content

Distributional Compositional Semantics for Text Processing

Final Report Summary - DISCOTEX (Distributional Compositional Semantics for Text Processing)

The notion of meaning has been a central topic of study in many disciplines including Philosophy, Linguistics, Cognitive Science, Computer Science, and Artifical Intelligence (AI). From an AI perspective, developing a computational theory of meaning --- i.e. one which can be represented in, and exploited by, a computer --- is crucial to developing computer programs which can display intelligent behaviour. In the DisCoTex project (Distributional Compositional Semantics for Text Processing) we have developed a new theory of meaning which can be used to induce meaning representations from large bodies of online text, and exploited by computers when processing such text.

DisCoTex is focused on a sub-branch of AI called Natural Language Processing (NLP, also known as Computational Linguistics), which is concerned with the automatic processing, analysis, and generation of natural language. The need to develop effective NLP technology is becoming more pressing as humans produce more information in the form of electronic text. Automatic tools are required for all of these applications to process and extract information from the textual data. In order to develop intelligent tools, the computer needs to acquire knowledge of the meanings of words and, crucially, requires a mechanism for combining the word meanings into meanings of whole phrases and sentences.

The achievements of the project are as follows:

o we have developed methods for inducing distributed word meanings from text, using techniques from neural-based machine learning (sometimes known as "Deep Learning");

o we have developed a whole theory of how to combine such distributed word meanings to obtain meanings of phrases and sentences;

o we have applied the theory to the construction of a natural language parser, which is a tool for automatically inferring the grammatical relations between words in a sentence.

We expect that the combination of machine learning and linguistics, which is a hallmark of the DisCoTex project, will be used in the future to develop the next generation of intelligent text processing software.