Skip to main content
European Commission logo print header

Grammar-Based Robust Natural Language Processing

Final Report Summary - GRAMPLUS (Grammar-Based Robust Natural Language Processing)

The GramPlus project had the aim of extending linguistic theory and its existing computational applications in several different theoretical, computational, and applied directions using Combinatory Categorial Grammar (CCG). CCG is a ”radically lexicalized” theory of grammar, in which all syntactic and semantic information specific to a particular language such as English or Hindi---such as whether its verb is initial, medial or final in the sentence---resides in its lexicon of words such as nouns and verbs. A universal mechanism of combinatory rules projects both syntactic and semantic aspects of the lexicon onto the sentences of the language, including long-range semantic dependencies in constructions like relative clauses and complex conjunctions vital in applications such as question-answering.

CCG links syntax and semantics very closely at every stage of derivation. It has been widely adopted for computational applications, including robust wide coverage parsing using statistical models derived by machine learning from datasets such as the Penn Treebank, particularly in tasks where semantic interpretation is required, such as question-answering or semantics-based parser induction.

Like all contemporary parsers, those based on CCG are limited by the “labeled data bottleneck”---current resources like the Penn Treebank are too small to provide really reliable parsers. The GramPlus project proposes a number of extensions to CCG itself and to the related computational applications, including an extended robust semantics covering both logical operators like negation and distributional relations of paraphrase and entailment between content words and expressions, semi-supervised methods for generalizing Treebank parsers using large amounts of unlabeled text to augment supervised methods using machine learning, methods for inducing grammars and parsers for many languages from paired sentences and meaning representations, among others. The results of the project include: successful parser generalization using a number of semi-supervised methods training on unlabeled text; new parsing techniques including semi-supervised supertaggers and incremental algorithms with state-of-the-art speed and accuracy; improved parsers for under-resourced languages including Hindi; combined logical and distributional semantics with state-of-the-art performance in application to question answering; new techniques for automatic semantic parser induction from sentences paired with database queries, which have been successfully applied in a psychologically and linguistically plausible model of child language learning on the basis of exposure to meaning-revealing context; a semantics for the discourse information implicit in English intonation; and a demonstration that musical harmony can be analysed using the same kind of CCG grammar, with the same parsing algorithm, and statistical model. These methods and results are for the most part independent of the specific grammatical approach used in the project, and are of general interest to a range of linguists, computational linguists, psychologists, and other cognitive scientists, as well as those interested in robust practical applications of Natural Language Processing