Skip to main content
European Commission logo
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary
Contenido archivado el 2024-06-18

Analysis of Natural Language for Real World Applications

Periodic Report Summary 1 - ANALYSIS (Analysis of Natural Language for Real World Applications)

Natural language understanding, information extraction, and machine translation are critical for human-machine interaction and key technologies in many everyday applications (e.g. search engines, mobile devices, robots). Natural language understanding systems transform spoken language or written texts into syntactic and semantic structures. Critically, these systems need to be able to work flexibly on many different text genres (i.e. language domains) including emails, web-texts, newspapers, conversations, spoken language or product reviews.

What are the problems and limitations of current syntactic analyzers?

A critical challenge for current syntactic analyzers in real-world applications is to adapt flexibly to different language domains. This problem arises because current syntactic analyzers are trained primarily on syntactically annotated newspaper texts (e.g. Wall Street Journal for English). In particular, the major syntactic resource for training syntactic analyzers in English is an annotated text collection called the Penn-Tree Bank. The Penn-Tree Bank contains texts from only one genre that is economic news. However, the syntactic analyzers are applied to a wide range of text genres such as emails, newsgroups, blogs, consumer reviews, newspapers with mostly non-economic text, spoken language etc. When applied to these texts the error rate doubles as is shown in the table comparing performance on in-domain and out-of-domain data (e.g. Wall Street Journal vs. answers, newsgroups, etc.).

The overarching aim of this research proposal is to develop novel syntactic analysis approaches that can flexibly work on a variety of text genres that are different from the training domain.