CORDIS - Risultati della ricerca dell’UE
CORDIS

Heterogeneous Learning for Natural Language Processing

Final Report Summary - HELENLP (Heterogeneous Learning for Natural Language Processing)

The problem of integrating information from various sources of variable quality and of different types into knowledge is fundamental in artificial intelligence, machine learning, language analysis and databases. Most of modern machine learning approaches rely on annotated examples by experts - a process which is expensive, slow and prone to errors. Furthermore, many annotated datasets are built from sources with specific characteristics and for a narrow task, and often are not used to build systems for wider goals. For example, the famous Penn Treebank [MSM93] is based on articles from the Walt Street Journal, yet current practice in machine learning requires a fresh annotated dataset if we wish to process blogs.
The digital era made a huge amount of heterogeneous collections of data accessible - images, audio, video and mainly text, primarily via the world-wide-web. This heterogeneous data raises wonderful opportunities for building statistical-based automated systems for various natural language processing tasks, with applications range from automatic document classification, via a full range of information extractions to speech analysis and recognition.
My long-term objective is to design approaches and build systems that better interacts with humans, and in this project we focused in the problem of improving performance using additional sources. These sources can be already annotated and be used passively, or partially-annotated and to be used actively by automatically querying an annotator.
Our research focused on developing new computational and statistical methods for integrating data from various sources, and apply these methods on tasks of natural language. We developed ways to annotate partially annotated data, either passively, with no annotator, and actively with an annotator. We also developed algorithms for learning few tasks simultaneously by combining annotated data from few sources. We applied our methods on various tasks including text categorization, and phoneme segmentation and recognition.
The objective of the reintegration period was to establish a strong research group in machine learning that becomes a center in its main research themes. This goal was very successfully achieved, in terms of researchers, funding and publications.
* People: Prof Crammer founded and manages a machine learning group. The group currently includes 1 post-doctoral research (together with another faculty member), 3 PhD students, 6 MSc students, and an engineer. Ten (10) MSc students have graduated under the supervision of Prof Crammer during the course of this project.
* Funding and Equipment: The group has secured funding in addition to the Marie-curie integration grant, including two ISF grants: one for equipment purchase and one for funding students fellowships. The group has strong ties with the industry which is also reflected with funding from companies like Google, Intel, and Yahoo. The research in this project also lead to funding form the ministry of economics for transferring knowledge from Academia to industry. Together with another researcher our group has acquired a large and strong computing cluster.
The work of our machine learning group covers large spectrum in machine learning, and especially focused in three areas. First, online learning where we develop algorithms, analyze them and apply then in practice. These algorithms are efficient, work on-the-fly and are able to adapt to changing environments. About four of the MSc master theses are directly or indirectly related to online learning. Second, learning with few heterogeneous sources, which is directly linked to this project. We developed theory and algorithms for learning few tasks simultaneously, annotate partially annotated data (transducing learning and semi-supervised learning), and using data from few domains. Three of the master theses covers these topics. Third, applications of our methods in general, and in natural language in particular, such as speech and text. Most of the empirical work performed in the group related to these applications, and one additional thesis in focused solely on applications. Two theses in the group are focused in core theory and algorithms in machine learning.
To conclude, the reintegration grant has been a major funding source for the group, and served a critical role in its success. With the support of this grant we were able to secure additional funding for the group and were able to build a major group in machine learning in Technion and Israel.