We have developed and published a wide range of research results on fast syntactic parsing models, as well as research studying underlying properties of human language that are crucial to build efficient, cognitively-inspired syntactic parsers. These results include:
- Quantitative studies of various statistical and syntactic factors that affect the difficulty and speed of computer parsing of human languages.
- Novel techniques for greedy transition-based parsing (the fastest family of parsing models previously available), including new dynamic oracles to increase accuracy without affecting speed, non-monotonic parsing techniques, semantic parsing models, and techniques to reduce the number of actions needed to parse a sentence.
- New architectures making exact inference for non-projective transition-based parsing (a highly accurate and flexible method for parsing) practically feasible for the first time. Previously, the computational requirements of known algorithms made these techniques too slow to be usable in practice.
- A new method to implement constituent parsing as sequence labeling, a standard machine learning task which is fast and easy to implement, taking advantage of the bounded depth of trees in real natural language usage. This produced the fastest parsing speeds reported for English and other languages by a wide margin. We have also generalized the approach to other parsing tasks: dependency parsing and discontinuous constituent parsing, again breaking reported speed records in both cases; created multitask models that combine several parsing tasks; and developed a unifying theory of transition-based and sequence-labeling parsers.
- A technique for speeding up parsers using memorized previous results.
- A cognitively-inspired model based on splitting sentences into chunks to obtain faster parsers.
- A variant of the previously most accurate dependency parser (a parser based on a recent neural network architecture called pointer networks) with a new algorithm to guide processing, increasing its accuracy even more while making it twice as fast; as well as an adaptation of the same new algorithm for discontinuous constituents (needed to accurately model the syntax of languages such as German) whose accuracy is the best reported to date by a wide margin, while being very fast. Both these parsers have then been combined into a single model, the first able to produce discontinuous constituents and dependencies at the same time, which advanced state-of-the-art accuracy in various benchmarks.
- A generic neural-network-based method to reduce discontinuous to continuous constituent parsing, so we can now employ existing constituent parsing algorithms designed for "easy" languages (such as English or Chinese) for much harder cases requiring discontinuous constituents (e.g. German). The resulting models have accuracy on par with the state of the art on discontinuous parsing, but run considerably faster than existing approaches.
Putting it all together, these results constitute a breakthrough in parsing speed. The results have been disseminated in the main conferences and journals in the field, among other venues, and all publications and source code are available in public repositories. We plan to exploit the results in downstream applications like sentiment analysis or grammar checking.