Periodic Report Summary 1 - ANALYSIS (Analysis of Natural Language for Real World Applications)
What are the problems and limitations of current syntactic analyzers?
A critical challenge for current syntactic analyzers in real-world applications is to adapt flexibly to different language domains. This problem arises because current syntactic analyzers are trained primarily on syntactically annotated newspaper texts (e.g. Wall Street Journal for English). In particular, the major syntactic resource for training syntactic analyzers in English is an annotated text collection called the Penn-Tree Bank. The Penn-Tree Bank contains texts from only one genre that is economic news. However, the syntactic analyzers are applied to a wide range of text genres such as emails, newsgroups, blogs, consumer reviews, newspapers with mostly non-economic text, spoken language etc. When applied to these texts the error rate doubles as is shown in the table comparing performance on in-domain and out-of-domain data (e.g. Wall Street Journal vs. answers, newsgroups, etc.).
The overarching aim of this research proposal is to develop novel syntactic analysis approaches that can flexibly work on a variety of text genres that are different from the training domain.