Final Report Summary - HELENLP (Heterogeneous Learning for Natural Language Processing)
The digital era made a huge amount of heterogeneous collections of data accessible - images, audio, video and mainly text, primarily via the world-wide-web. This heterogeneous data raises wonderful opportunities for building statistical-based automated systems for various natural language processing tasks, with applications range from automatic document classification, via a full range of information extractions to speech analysis and recognition.
My long-term objective is to design approaches and build systems that better interacts with humans, and in this project we focused in the problem of improving performance using additional sources. These sources can be already annotated and be used passively, or partially-annotated and to be used actively by automatically querying an annotator.
Our research focused on developing new computational and statistical methods for integrating data from various sources, and apply these methods on tasks of natural language. We developed ways to annotate partially annotated data, either passively, with no annotator, and actively with an annotator. We also developed algorithms for learning few tasks simultaneously by combining annotated data from few sources. We applied our methods on various tasks including text categorization, and phoneme segmentation and recognition.
The objective of the reintegration period was to establish a strong research group in machine learning that becomes a center in its main research themes. This goal was very successfully achieved, in terms of researchers, funding and publications.
* People: Prof Crammer founded and manages a machine learning group. The group currently includes 1 post-doctoral research (together with another faculty member), 3 PhD students, 6 MSc students, and an engineer. Ten (10) MSc students have graduated under the supervision of Prof Crammer during the course of this project.
* Funding and Equipment: The group has secured funding in addition to the Marie-curie integration grant, including two ISF grants: one for equipment purchase and one for funding students fellowships. The group has strong ties with the industry which is also reflected with funding from companies like Google, Intel, and Yahoo. The research in this project also lead to funding form the ministry of economics for transferring knowledge from Academia to industry. Together with another researcher our group has acquired a large and strong computing cluster.
The work of our machine learning group covers large spectrum in machine learning, and especially focused in three areas. First, online learning where we develop algorithms, analyze them and apply then in practice. These algorithms are efficient, work on-the-fly and are able to adapt to changing environments. About four of the MSc master theses are directly or indirectly related to online learning. Second, learning with few heterogeneous sources, which is directly linked to this project. We developed theory and algorithms for learning few tasks simultaneously, annotate partially annotated data (transducing learning and semi-supervised learning), and using data from few domains. Three of the master theses covers these topics. Third, applications of our methods in general, and in natural language in particular, such as speech and text. Most of the empirical work performed in the group related to these applications, and one additional thesis in focused solely on applications. Two theses in the group are focused in core theory and algorithms in machine learning.
To conclude, the reintegration grant has been a major funding source for the group, and served a critical role in its success. With the support of this grant we were able to secure additional funding for the group and were able to build a major group in machine learning in Technion and Israel.