CORDIS - Forschungsergebnisse der EU
CORDIS

Fast Natural Language Parsing for Large-Scale NLP

Periodic Reporting for period 4 - FASTPARSE (Fast Natural Language Parsing for Large-Scale NLP)

Berichtszeitraum: 2021-08-01 bis 2022-07-31

The popularization of information technology and the Internet has resulted in an unprecedented growth in the scale at which individuals and institutions generate, communicate and access information. In this context, the effective leveraging of the vast amounts of available data to discover and address people's needs is a fundamental problem of modern societies.

Since most of this circulating information is in the form of written or spoken human language, natural language processing (NLP) technologies are a key asset for this crucial goal. NLP can be used to break language barriers (machine translation), find required information (search engines, question answering), monitor public opinion (opinion mining), or digest large amounts of unstructured text into more convenient forms (information extraction, summarization), among other applications.

These and other NLP technologies rely on accurate syntactic parsing to extract or analyze the meaning of sentences. Unfortunately, state-of-the-art parsing algorithms existing before this project had high computational costs, processing less than a hundred sentences per second on standard hardware. While this was acceptable for working on small sets of documents, it was clearly prohibitive for large-scale processing, and thus constituted a major roadblock for the widespread application of NLP.

The goal of this project was to eliminate this bottleneck by developing fast parsers that are suitable for web-scale processing. To do so, FASTPARSE aimed to improve the speed of parsers on several fronts: by avoiding redundant calculations; by applying cognitively-inspired models; and by exploiting regularities in human language to find patterns that the parsers can exploit to reduce the search space and operate faster.

The goal was achieved, as FASTPARSE developed parsers for multiple syntactic formalisms and languages that operate in the range of a thousand sentences per second on consumer hardware while keeping high accuracy, even for hard to parse languages that require expressive syntactic representations; and thus can be used to power all kinds of web-scale NLP applications.
We have developed and published a wide range of research results on fast syntactic parsing models, as well as research studying underlying properties of human language that are crucial to build efficient, cognitively-inspired syntactic parsers. These results include:

- Quantitative studies of various statistical and syntactic factors that affect the difficulty and speed of computer parsing of human languages.

- Novel techniques for greedy transition-based parsing (the fastest family of parsing models previously available), including new dynamic oracles to increase accuracy without affecting speed, non-monotonic parsing techniques, semantic parsing models, and techniques to reduce the number of actions needed to parse a sentence.

- New architectures making exact inference for non-projective transition-based parsing (a highly accurate and flexible method for parsing) practically feasible for the first time. Previously, the computational requirements of known algorithms made these techniques too slow to be usable in practice.

- A new method to implement constituent parsing as sequence labeling, a standard machine learning task which is fast and easy to implement, taking advantage of the bounded depth of trees in real natural language usage. This produced the fastest parsing speeds reported for English and other languages by a wide margin. We have also generalized the approach to other parsing tasks: dependency parsing and discontinuous constituent parsing, again breaking reported speed records in both cases; created multitask models that combine several parsing tasks; and developed a unifying theory of transition-based and sequence-labeling parsers.

- A technique for speeding up parsers using memorized previous results.

- A cognitively-inspired model based on splitting sentences into chunks to obtain faster parsers.

- A variant of the previously most accurate dependency parser (a parser based on a recent neural network architecture called pointer networks) with a new algorithm to guide processing, increasing its accuracy even more while making it twice as fast; as well as an adaptation of the same new algorithm for discontinuous constituents (needed to accurately model the syntax of languages such as German) whose accuracy is the best reported to date by a wide margin, while being very fast. Both these parsers have then been combined into a single model, the first able to produce discontinuous constituents and dependencies at the same time, which advanced state-of-the-art accuracy in various benchmarks.

- A generic neural-network-based method to reduce discontinuous to continuous constituent parsing, so we can now employ existing constituent parsing algorithms designed for "easy" languages (such as English or Chinese) for much harder cases requiring discontinuous constituents (e.g. German). The resulting models have accuracy on par with the state of the art on discontinuous parsing, but run considerably faster than existing approaches.

Putting it all together, these results constitute a breakthrough in parsing speed. The results have been disseminated in the main conferences and journals in the field, among other venues, and all publications and source code are available in public repositories. We plan to exploit the results in downstream applications like sentiment analysis or grammar checking.
We have substantially advanced the state of the art in several fronts. Firstly, in terms of the speed of parsing algorithms, which is the main goal of the project: for example, we have presented greedy parsers needing fewer actions than previous ones to analyze a sentence; we have made dynamic programming for non-projective transition-based parsing feasible by presenting an actual implementation, while previous approaches were purely theoretical due to being too slow to be usable in practice; and we have introduced a novel parsing paradigm based on reformulating the problem as sequence labeling, yielding significantly better speeds than all previous existing methods; among other developments. Secondly, in terms of understanding of the underlying difficulties for parsing and human language processing (by means of studies of quantitative properties of human syntax). Additionally, we have also advanced state of the art accuracy both in dependency parsing and in phrase structure parsing with discontinuous constituents. In dependency parsing, our left-to-right pointer-network parsers achieved the highest accuracy in the standard parsing benchmarks of English and various other languages, while at the same time being twice as fast as its predecessors. In discontinuous constituent parsing, our model outperformed the previous state of the art in German (the commonly-used benchmark language for this task) by a wide margin of over 4 percentage points.

Overall, we have achieved major gains in parsing speed, spanning different languages (including those especially challenging for computational processing, such as languages with high degree of non-projectivity, or with complex morphology) and parser types (including constituent and dependency parsers) that make parsing feasible at the web scale without the need for massive computational resources, hence making natural language processing technologies more useful and widely accessible.
FASTPARSE project logo