As the market for NLP products and services is growing, clear patterns of types of systems start to emerge. In the future, prospective buyers and end-users of NLP products will both be confronted more and more with the problem of choosing the product that best meets their specific requirements. Suppliers of NLP products and services may want to know how their systems and tools compare to those of their competitors. Developers and researchers are likely to be interested in figuring out whether their system performs according to their specifications. At present, companies, institutions and corporate users interested in any of the above-mentioned evaluation types spend a considerable amount of time and effort in building data and tools for their own test purposes. This project aims to alleviate the situation with repect to data and tools, by defining guidelines and a methodology for the construction of diagnostic data ("test suites") and designing and implementing related tools.
The guidelines will be validated by constructing application-specific diagnostic data for French, German and English and testing them on a number of applications ranging from parsers through grammar checkers to controlled language checkers. In addition to devising guidelines, the project is to investigate techniques and design and implement tools that will facilitate the construction, use and manipulation of test suites, such as a database for storing and manipulating test data, and tools for the (semi-)automatic generation of test suites.
Approach and Methodology
The first task of the project will be to survey existing tests suites and draw up specifications for describing them. This survey will help identify the types of NLP applications for which the test suite has been used, and the type of evaluation where it has been applied.
The main thrust of the project will be to set up guidelines for the construction of test suites. Some of the issues which the project will investigate may be application-independent (e.g. size), others may be application-dependent (e.g. the necessity of avoiding examples which involve translational problems when constructing a test suite for monolingual applications). The project will also investigate ways of assigning weightings to test sentences and an efficient annotation scheme. The annotation scheme adopted will be developed with the aim of storing the test suite fragments in a database.
The soundness of the proposed guidelines and methods will be demonstrated by constructing test data for French, German and English, and validating the resulting test suites against a number of applications and/or components.
In addition to devising guidelines and methods for the construction of diagnostic test data, the project intends also to investigate techniques and design and implement tools that will facilite the construction/generation, use and manipulation of test suites.
Firstly, the project will investigate techniques for automatically generating test suites, e.g. by means of special, simple test suite grammars. Test suites are normally hand-constructed. However, this process is difficult (requiring considerable linguistic sophistication and skill), laborious, tedious, and above all error prone. All this suggests that the process is a good candidate for automation, or more precisely, for an interactive process that involves a substantial amount of automation. Another advantage of automation is that this allows "dynamic" test suite construction, where test data can be replaced by new data which test the same phenomena. In this way, it may be possible to overcome one of the problems that sometimes crop up in system evaluation, namely that developers can tune their application so that it deals with static test data. Finally, automatic, dynamic test suite generation should open the possibility of using very large lexicons, perhaps with some "randomisation", thus making it poss ble to hold and transmit extremely large 'virtual' suites, in the order of many millions of sentences, providing standard benchmarks for system testing.
Secondly, the project will investigate whether and to what extent it is possible to derive test suites (semi-) automatically from corpora.
Finally, the project intends to design and develop a relational database for storing and manipulating test suite data. The annotated test suite fragments built during the project will be stored in that database.
Exploitation and Future Prospects
The main result of this project will be the guidelines and the methodology for test suite construction, that can be used in different NLP application fields and systems. It is expected that a set of guidelines will facilitate the interpretation of test suites and enhance their portability. This will be of direct benefit to all those companies and institutions that nowadays spend a considerable amount of time and effort in building test suites for their own purposes.
The results are also likely to be useful for several areas of linguistic research, since they provide a catalogue of linguistic data of potential value to theoretical and empirical work in linguistics.
All project results will become publicly available. The tools will have a high degree of portability, allowing for easy integration into a common framework (e.g. ALEP). The availability of projects results will be widely publicised at conferences, evaluation workshops and in magazines in the field of NLP, in order to create optimal conditions for exploitation by a wide number of users.