Periodic Reporting for period 4 - FASTPARSE (Fast Natural Language Parsing for Large-Scale NLP)
Période du rapport: 2021-08-01 au 2022-07-31
Since most of this circulating information is in the form of written or spoken human language, natural language processing (NLP) technologies are a key asset for this crucial goal. NLP can be used to break language barriers (machine translation), find required information (search engines, question answering), monitor public opinion (opinion mining), or digest large amounts of unstructured text into more convenient forms (information extraction, summarization), among other applications.
These and other NLP technologies rely on accurate syntactic parsing to extract or analyze the meaning of sentences. Unfortunately, state-of-the-art parsing algorithms existing before this project had high computational costs, processing less than a hundred sentences per second on standard hardware. While this was acceptable for working on small sets of documents, it was clearly prohibitive for large-scale processing, and thus constituted a major roadblock for the widespread application of NLP.
The goal of this project was to eliminate this bottleneck by developing fast parsers that are suitable for web-scale processing. To do so, FASTPARSE aimed to improve the speed of parsers on several fronts: by avoiding redundant calculations; by applying cognitively-inspired models; and by exploiting regularities in human language to find patterns that the parsers can exploit to reduce the search space and operate faster.
The goal was achieved, as FASTPARSE developed parsers for multiple syntactic formalisms and languages that operate in the range of a thousand sentences per second on consumer hardware while keeping high accuracy, even for hard to parse languages that require expressive syntactic representations; and thus can be used to power all kinds of web-scale NLP applications.
- Quantitative studies of various statistical and syntactic factors that affect the difficulty and speed of computer parsing of human languages.
- Novel techniques for greedy transition-based parsing (the fastest family of parsing models previously available), including new dynamic oracles to increase accuracy without affecting speed, non-monotonic parsing techniques, semantic parsing models, and techniques to reduce the number of actions needed to parse a sentence.
- New architectures making exact inference for non-projective transition-based parsing (a highly accurate and flexible method for parsing) practically feasible for the first time. Previously, the computational requirements of known algorithms made these techniques too slow to be usable in practice.
- A new method to implement constituent parsing as sequence labeling, a standard machine learning task which is fast and easy to implement, taking advantage of the bounded depth of trees in real natural language usage. This produced the fastest parsing speeds reported for English and other languages by a wide margin. We have also generalized the approach to other parsing tasks: dependency parsing and discontinuous constituent parsing, again breaking reported speed records in both cases; created multitask models that combine several parsing tasks; and developed a unifying theory of transition-based and sequence-labeling parsers.
- A technique for speeding up parsers using memorized previous results.
- A cognitively-inspired model based on splitting sentences into chunks to obtain faster parsers.
- A variant of the previously most accurate dependency parser (a parser based on a recent neural network architecture called pointer networks) with a new algorithm to guide processing, increasing its accuracy even more while making it twice as fast; as well as an adaptation of the same new algorithm for discontinuous constituents (needed to accurately model the syntax of languages such as German) whose accuracy is the best reported to date by a wide margin, while being very fast. Both these parsers have then been combined into a single model, the first able to produce discontinuous constituents and dependencies at the same time, which advanced state-of-the-art accuracy in various benchmarks.
- A generic neural-network-based method to reduce discontinuous to continuous constituent parsing, so we can now employ existing constituent parsing algorithms designed for "easy" languages (such as English or Chinese) for much harder cases requiring discontinuous constituents (e.g. German). The resulting models have accuracy on par with the state of the art on discontinuous parsing, but run considerably faster than existing approaches.
Putting it all together, these results constitute a breakthrough in parsing speed. The results have been disseminated in the main conferences and journals in the field, among other venues, and all publications and source code are available in public repositories. We plan to exploit the results in downstream applications like sentiment analysis or grammar checking.
Overall, we have achieved major gains in parsing speed, spanning different languages (including those especially challenging for computational processing, such as languages with high degree of non-projectivity, or with complex morphology) and parser types (including constituent and dependency parsers) that make parsing feasible at the web scale without the need for massive computational resources, hence making natural language processing technologies more useful and widely accessible.