Grammar Assistance Using Syntactic Structures: Fast parsing with formal grammars deployed for Spanish grammar coaching.

Project Information

GAUSS

Grant agreement ID: 101063104

DOI

10.3030/101063104

Project closed

EC signature date 28 June 2022

Start date 1 August 2022

End date 31 July 2024

Funded under

Marie Skłodowska-Curie Actions (MSCA)

Total cost

No data

EU contribution

€ 165 312,96

Coordinated by

UNIVERSIDADE DA CORUNA
Spain

Periodic Reporting for period 1 - GAUSS (Grammar Assistance Using Syntactic Structures: Fast parsing with formal grammars deployed for Spanish grammar coaching.)

Reporting period: 2022-08-01 to 2024-07-31

The Grammar Assistance Using Syntactic Structures (GAUSS) project is concerned with a new, faster parsing technology for grammar coaching. For this purpose, the project developed a model of the Spanish grammar targeting specific learner constructions. The grammar can therefore parse learner constructions, assigning to them an informative structure. Such structures can then be used in grammar coaching. Automatic grammar coaching helps people write more like a native speaker of a language would, thus helping them navigate around biases associated with language. This is important for (i) finding a job and counterbalancing latent discrimination in any given society, in the case of major languages like Spanish; and (ii) reinforcing the understanding that each language has a systematic grammar in its own right, in the case of minority languages.

The project focuses on pushing forward the state of the art of grammar parsing (using Spanish as the domain) in terms of its accuracy and performance. In other words, the project aims to present a model of Spanish grammar which yields more correct structures and does it faster. Spanish is one of the most spoken languages in the world, and one of the languages of the European Union. We choose Spanish as one of the most impactful domains.

Research Objectives (RO):

RO1: Fast HPSG parsing for realistic long sentences in Spanish.
RO2: Spanish error productions formalized in a version of Spanish Resource Grammar.
RO3: Empirical integration of RO1-RO2.

Work Package 1 – Make HPSG parsing faster on long sentences
• Progress: Good/satisfactory progress
• Summary of the activities: I have pursued three methodologies: (1) supertagging, which is learning to filter out the less probable interpretations of a word, leading to smaller parse charts and faster parsing; (2) grammar design improvement, which reduces structural ambiguity and also leads to smaller parse charts and faster parsing; (3) top-down parsing, a family of alternative parsing algorithms so far not implemented in DELPH-IN HPSG. Parsing with the Spanish Resource Grammar is now faster, and I have also presented a version of the parser that is faster for the English Resource Grammar, using a methodology that can be used also for Spanish.

Work Package 2 – Develop error production rules for Spanish and add them to Spanish Resource Grammar
• Progress: Good/satisfactory progress
• Summary of activities: I have taken an approach driven by research questions in SLA (second language acquisition), establishing a collaboration with Dr. Ogneva, who specializes in SLA (University of Santiago de Compostela). I have designed and integrated error production rules into the SRG and tested them on the learner corpus. On the other hand, I have deployed the original grammar (without the error production rules) on the learner corpus as well, to assess its overgeneration (how many sentences with errors it parses even though it does not have any error production rules designed specifically to cover them). This helped improve the original grammar both in terms of its performance (see WP2) and in terms of its theoretical quality.

Work Package 3 – Deploy the system [for testing] and collect feedback
• Progress: Good/satisfactory progress
• Summary of activities: All the code is released publicly via GitHub. Comprehensive feedback on the system quality was obtained by deploying it on several portions of two different learner corpora.

Work Package 4 – Analyze feedback and results; release and disseminate the final product
• Progress: Good/satisfactory progress (releasing the final package: limited progress)
• Summary of activities: I deployed the SRG equipped with error production rules on learner corpora and counted error rates with the learner’s first language as the independent variable; I started analyzing the types of mistakes that different learners make. All code and data are publicly released, a publication on the SRG current version is out, and two journal publications on the data analysis are in preparation. I am in contact with teachers, prospective users of the system. The grammar version capable of error detection is available as a web-demo but not yet in a form suitable for efficient public use (e.g. in a classroom).

Work Package 1:
o I have trained a supertagger for English that helps achieve a speed-up factor of 3 when parsing English with the English Resource Grammar (Zamaraeva and Gómez-Rodríguez 2024) https://github.com/olzama/delphin-parsing). Training the supertagger involved working with BERT models and Huggingface interfaces and APIs.
o I have produced Huggingface datasets (https://github.com/olzama/delphin-parsing) for the English HPSG treebanks (Redwoods). This outcome is important, as the Redwoods treebanks as they come in the form of DELPH-IN databases can be non-trivial for non-DELPH-IN researchers to use. As such, I made a step towards making DELPH-IN technology more accessible in the field of natural language processing.
o I have trained a supertagger for Spanish, and it turned out that more labelled data is needed for Spanish to achieve a high enough accuracy. Currently, the accuracy is around 77%, whereas for English it is 94%. Therefore, I invested more in WP2 which is related to developing not only the Spanish Resource grammar but also the size and the quality of the associated labelled data (https://github.com/delph-in/srg). Furthermore, as part of the work done in WP2,
o I have found ways to make the Spanish grammar faster to parse with, up to 200% performance gain, which is reflected in release 0.3.5 of the grammar (https://github.com/delph-in/srg/releases/tag/v.0.3.5). The focus of the improvements is in identifying missing constraints in the grammar types and adding such constraints. This reduces the number of possibilities that the parser has to consider, given the grammar, resulting in faster parsing. Identifying missing constraints constitutes steps towards theoretical results—a better theory of HPSG agreement, to be presented in a journal paper currently in preparation (see also WP2).

Work Package 2:
o Ongoing collaboration with Dr. Ogneva (USC) on a journal publication related to testing SLA hypotheses on a learner corpus using a new method with the SRG. The method relies on using the SRG for establishing not only learner error rates depending on variables such as the learner’s first language but also deep syntactic contexts of the errors (what syntactic structure surrounds the error). This method is new for the field of SLA, as far as we know. Previously, shallower methods have been used when studying learner corpora.
o I integrated a system of error production rules (70 lexical rules) into a dedicated branch of the Spanish Resource Grammar (https://github.com/olzama/gauss/tree/main/grammars/srg-mal). The domain was determined by the research questions in second language acquisition (gender agreement in the noun phrase). The rule design was driven by a learner corpus (COWSL2H, developed in UC Davis).
o In the process, I identified missing constraints in the main branch of the grammar. I added those missing constraints, and as a result, the grammar has become faster to parse with, contributing to WP1 and leading to more longer sentences being parsed (see WP1). The coverage of the modified grammar was measured on a portion of a learner corpus COWSL2H and it is 83% on sentences up to length 9.
o I presented and published a paper at LREC-COLING 2024 conference in Turin, Italy showcasing the new version of the Spanish Resource Grammar (https://aclanthology.org/2024.lrec-main.1312/).

Work Package 3:
o Web demo: The grammar is available via the web demo hosted at VU Amsterdam (DELPH-IN collaboration; https://compling.cltl.labs.vu.nl/itell/delphin_analyser). The demo allows to enter a morphophonologically analyzed version of a Spanish sentence and to obtain an HPSG parse tree. A version of the demo which accepts a plain text representation of sentences and gives feedback based on the error production rule used in the parse is in progress.
o I have parsed portions of several learner corpora (COWSL2H, L1 English and CEDEL2, various L1s: French, Italian, Portuguese, Chinese, Japanese, Russian, Arabic) with the SRG equipped with error-production rules. The coverage of the SRG over the portion of COWSL2H is 86%.
o I ran the original SRG (no error production rules) on the COWSL2H learner corpus, in order to see if it covers some of the sentences with non-target gender agreement (mismatched agreement, or gender agreement errors). Examining such sentences allowed me to identify missing gender constraints in the grammar. Adding these constraints led to better SRG performance (see WP1) and accuracy, which went up from the 80% average over the sentences of length 1-12 to 86%.
o A paper is in preparation with the new analysis of agreement for the original SRG, to be submitted to Language Resources and Evaluation journal or Corpus Linguistics and Linguistic Theory.

Work Package 4:
o I have started analyzing the results from deploying the grammar on two learner corpora, with the focus on research questions from second language acquisition (SLA/L2). Specifically, I am in the process of testing the following hypotheses: (1) if the learner’s L1 has no gender system, they will make more gender agreement mistakes than the learners whose L1 has a gender system; (2) Learners whose L1 has a gender system similar to Spanish (e.g. two genders) will make fewer mistakes than those whose L1 has a different gender system (e.g. 3 genders). These are well-known hypotheses which have not yet been tested on corpora (only on smaller-scale data collected in experimental lab settings, with data mainly being grammaticality judgments supplied by participants for sentences constructed by researchers). My approach with learner corpora allows for faster data collection (of course, collecting the corpora originally was time consuming, but now that we have them, we should analyze them at scale).
o An open-source current version is available (https://github.com/olzama/gauss) as well as a web demo (https://compling.cltl.labs.vu.nl/itell/delphin_analyser). However, due to the development of the user interface requiring additional non-academic expertise, I do not yet have a packaged product suitable for a classroom. The demo requires a special input format.
o I have established a relationship with a Spanish-as-a-second-language teacher Laura Rodríguez from Academia Equipo (A Coruña, Spain). She is willing to try my system once the user interface issues are solved. I also had a consultation with a Spanish as a second language teacher from Escuela de idiomas (a large language school in A Coruña, Spain), Iago Fernández.

Periodic Reporting for period 1 - GAUSS (Grammar Assistance Using Syntactic Structures: Fast parsing with formal grammars deployed for Spanish grammar coaching.)

Share this page Share this page on social networks

Download PDF Download the content of the page