Skip to main content
European Commission logo print header

Machine Learning Methods for Complex Outputs and Their Application to Natural Language Processing and Computational Biology

Final Report Summary - JOINTSTRUCTUREDPRED (Machine learning methods for complex outputs and their application to natural language processing and computational biology)

The overall goal of this machine learning project is to make inference on complex tasks, which consist of multiple structured prediction tasks. The predominant approach, i.e. defining simpler subtasks and solving them independently or in a cascaded manner, is prone to error propagation. It can also limit the information flow across tasks. The goal of this project is to overcome these limitations by training the subtasks jointly using multi-task learning techniques and approximate inference methods where applicable. The application fields have been identified as natural language processing (NLP) and computational biology (CompBio).

The joint training of structured prediction problems (for inference on complex tasks) involves coupling subtasks of different interaction types and learning across these tasks using machine learning techniques. In this project, various learning techniques have been investigated in particular via defining model parameters to capture the pre-defined interactions and via coupling the model parameters with different regularisation schemes. These coupling relations are applied to different structural interactions, namely hierarchically related sequence prediction tasks in CompBio and multiple sequence prediction problems modelled as Markov and Semi-Markov Chains in NLP. Different optimisation techniques have been investigated for feasibility and scalability, which is a particularly important aspect for sequence problems in CompBio where training can be very expensive.

To evaluate the proposed method and the baseline empirically, a software package has been developed for the cascaded and joint approach for structured prediction. In particular, conditional random fields and structured perceptron, have been implemented with different regularization and coupling schemes. The tasks for NLP problems have been identified as part of speech tagging (POS), shallow parsing and parsing, which involve a Markov chain, semi-Markov chain and a tree structure respectively. Benchmark datasets and features have been extracted. For CompBio, the problem of splice site recognition on multiple organisms has been identified as an application for hierarchically related multiple sequence tasks. A taxonomy has been generated to related 15 different organisms. Real and synthetic datasets were acquired in order to enable a more thorough experimental analysis.

Empirical evaluation in CompBio problems shows that the proposed method outperforms the baseline on synthetic and real data by significantly improving prediction accuracy. The findings are similar (but not as pronounced) in NLP applications. It is conjectured that the performance improvement is due to combining the information available for all tasks and relating the training phases of all inference problems.

Complex prediction problems are ubiquitous in NLP, CompBio, computer vision and information retrieval. Hence, it is inherently an inter-disciplinary problem. The empirical results of this project can indicate that the proposed approach can be used to improve performance of many prediction problems in a wide range of disciplines, especially for cases where the hand-labelled data is limited. As such, it can contribute to the general ERA objective of promoting inter-disciplinary research, in particular between machine learning and fields such as biology, medicine, linguistics, signal processing, vision.