Periodic Reporting for period 2 - FASTPARSE (Fast Natural Language Parsing for Large-Scale NLP)
Reporting period: 2018-08-01 to 2020-01-31
Since most of this circulating information is in the form of written or spoken human language, natural language processing (NLP) technologies are a key asset for this crucial goal. NLP can be used to break language barriers (machine translation), find required information (search engines, question answering), monitor public opinion (opinion mining), or digest large amounts of unstructured text into more convenient forms (information extraction, summarization), among other applications.
These and other NLP technologies rely on accurate syntactic parsing to extract or analyze the meaning of sentences. Unfortunately, current state-of-the-art parsing algorithms have high computational costs, processing less than a hundred sentences per second on standard hardware. While this is acceptable for working on small sets of documents, it is clearly prohibitive for large-scale processing, and thus constitutes a major roadblock for the widespread application of NLP.
The goal of this project is to eliminate this bottleneck by developing fast parsers that are suitable for web-scale processing. To do so, FASTPARSE will improve the speed of parsers on several fronts: by avoiding redundant calculations through the reuse of intermediate results from previous sentences; by applying a cognitively-inspired model to compress and recode linguistic information; and by exploiting regularities in human language to find patterns that the parsers can take for granted, avoiding their explicit calculation. The joint application of these techniques will result in much faster parsers that can power all kinds of web-scale NLP applications.
- Quantitative studies of crossing dependencies (a syntactic phenomenon that makes language more difficult to parse for computers) in different languages, and their origin; as well as studies on the length and direction of syntactic dependencies (properties that also affects difficulty and speed of parsing) and their influence in parsing with different models.
- Novel techniques for greedy transition-based parsing (the fastest family of parsing models currently available), including new dynamic oracles to increase the accuracy of greedy algorithms without affecting their speed, non-monotonic parsing techniques, semantic parsing models, and algorithms that reduce the length of transition sequences needed to parse a sentence.
- New architectures that make exact inference for non-projective transition-based parsing (a highly accurate and flexible method for parsing) practically feasible for the first time. Previously, the computational complexity requirements of known algorithms for this purpose made these techniques too slow to be applicable in practice.
- A new method to implement constituent parsing as sequence labeling, a standard machine learning task that is usually considered simpler and faster to implement, taking advantage of the bounded depth of constituent trees in real natural language usage, which has produced the fastest parsing speeds reported for English and other languages by a wide margin. We have also applied a generalization of the same method to dependency parsing, again breaking reported speed records by a large margin, and used multitask learning so that constituent and dependency parsing can benefit from each other for the first time.
- A technique for speeding up parsers using memorized previous results.
- A cognitively-inspired model based on splitting sentences into chunks to obtain faster parsers.
- A variant of the previously most accurate dependency parser (a parser based on a recent neural network architecture called pointer networks) which uses a different algorithm to guide processing, increasing its accuracy even more while making it twice as fast.
- A parser, also based on pointer networks, for languages with discontinuous constituents (such as German) whose accuracy is the best reported to date by a wide margin, while being very fast.
During the rest of the project, we expect to continue to improve the speed of syntactic parsers by combining knowledge from linguistics, cognitive science and computer science. We will finish developing the techniques to speed up parsers based on reuse of previous results and on chunking, which will provide further speed gains when added to the models developed until now. We will also refine and improve the technique of parsing as sequence labeling to further optimize its speed, extend it to discontinuous constituents and semantic parsing, and continue exploiting regularities in constituent trees to further reduce the search space of parsers and improve their efficiency.
Overall, we intend to continue achieving major gains in parsing speed, spanning different languages (including those especially challenging for computational processing, such as languages with high degree of non-projectivity, or with complex morphology) and parser types (including constituent and dependency parsers) that will make parsing feasible at the web scale without the need for massive computational resources, hence making natural language processing technologies more useful and widely accessible.