Skip to main content
European Commission logo print header

A Sequence Classification Framework for Human Language Technology

Periodic Reporting for period 4 - SEQCLAS (A Sequence Classification Framework for Human Language Technology)

Okres sprawozdawczy: 2021-02-01 do 2021-07-31

The goal of this project was to develop a unifying framework of new methods for sequence classification and thus to push the state of the art in automatic speech recognition and statistical machine translation. The unifying principle in these human language technology (HLT) tasks is that the system has to handle a sequence of input data and to generate an associated sequence of output symbols (i.e. words or letters) in a natural language like English or Arabic. For speech recognition, the input is the acoustic waveform or sequence of feature vectors after feature extraction and the task is to generate a correct transcription of the spoken word sequence. For machine translation, the input sequence is the sequence of words (or letters) in a source language and the output sequence to be generated is a wellformed sequence of words (or letters) in the target language. There is a third HLT task that is very similar to speech recognition and that we will therefore occasionally consider: the recognition of text images, i.e. of printed or handwritten text.

Speech recognition, machine translation and text image recognition are key techniques that are needed in a large number every-day situations:

* information access and management:
Due to the progress in information technology, huge amounts of unstructured speech and text data are now available in the worldwide web and the archives of companies, organizations and people. This data may exist in three type of forms:
1. speech data in audio and video documents;
2. digital text as in books, newspapers, patents, word-processor documents and e-mails;
3. image text (printed or handwritten) in scanned books and documents.

* human-machine and human-human communication using speech:
applications that can support humans in communication (customer help lines, call centers, e-commerce, human-human communication using speech translation, question-answering systems, ...).
(full duration of the project: 01-August-2016 to 31-July-2021)

Despite the huge progress made in the field, the specific aspects of sequence classification have not been addressed adequately in the past research. Instead of developing specific solutions for each particular task independently of each other, our approach is to identify fundamental problems in sequence classification that show up across the three HLT tasks and to come up with a novel unifying framework. In agreement with the proposal, the work carried out was organized in 5 tasks:

* Task 1: a theoretical framework for sequence classification:
In principle, the starting point for virtually all successful approaches to sequence classification is the Bayes decision rule for optimum performance. In practice however, many simplifications and approximations are used, and it is not clear how they affect the final performance. We have done first steps towards developing a theoretical framework around the performance criterion that gives answers to questions like: How does the system performance depend on the language model?

* Task 2: consistent modeling:
Sequence classification requires probabilistic models whose parameters are learned from the training data. In this task, we put emphasis on the requirement that these models should be exactly the same in training and testing. The important result of this task is that we have developed a framework that allows a direct combination of deep learning and alignment concepts introduced for generative Hidden Markov models. This framework of neural Hidden Markov models has been applied both in speech recognition and machine translation. These approaches show competitive results with the state-of-the-art methods. In addition, we have worked on neural net based feature extraction directly from the speech time signal waveform, which in the future could improve the traditional spectral analysis.

* Task 3: performance-aware training and modelling:
What matters ultimately for a sequence classification system is its performance. Most activities in this task have been concerned with language models and acoustic models. In language modelling, we have achieved significant improvements by using refinements of recurrent neural networks. For acoustic modelling, we have improved the concept of sequence discriminative training, which resulted in significant improvements of the recognition accuracy. In addition, we studied the various ways in which the language model can be integrated into the training of an acoustic model or translation model in speech recognition and machine translation, respectively. For speech-to-text translation we studied an improved ASR-MT interface that better passes on the ambiguities of the ASR results to the MT engine in the cascaded approach.

* Task 4: unsupervised training:
In conventional MT, the required parallel sentence pairs for training can be expensive.
Therefore the question comes up: what can be done in a situation where no explicit sentence pairs are available, but only monolingual data in each of the two languages. This situation is referred to as fully unsupervised training. There is an intermediate situation called semi-supervised training where some amount of parallel sentence pairs is available in addition to a huge set of monolingual sentences in both the source and target languages. In this task, we worked on both types of unsupervised training.
For fully unsupervised training , we have achieved results for bidirectional English-German translation without using any parallel data for training. The training of the system was based on monolingual data only in combination with cross-lingual word embedding and iterative back-translation. This system achieved the first position among other unsupervised translation systems in the evaluation of WMT 2018. Later work was concerned with semi-supervised training where various methods like back translation, sequential transfer and cross-lingual word embeddings were studied. With these methods of semi-supervised training, it is possible to obtain remarkable translation performance using only a small amount of parallel sentence pairs for training.

* Task 5: research prototype systems:
The methods and algorithms developed have been integrated into the team's high-performance systems and have been evaluated on public databases and international benchmarks:
* general software toolkit (RETURNN) for sequence-to-sequence modelling for HLT tasks;
* for speech recognition: CHiME, Switchboard, Librispeech, TED-LIUM, WSJ;
* for machine translation: WMT tasks and IWSLT tasks 2017-2020.
The project has produced many interesting results that were presented at scientific conferences and workshops and go beyond the state of the art.
There are five directions that we consider to be the most important ones:
* a theoretical framework and bounds on classification error;
* integration of language modelling into acoustic modelling for ASR and into translation modelling for MT;
* a unifying view of finite-state based acoustic modelling which allows us to handle various types of acoustic models (like hybrid HMM, CTC, RNN-T) in the same mathematical framework;
* direct neural hidden Markov models for machine translation (rather than attention-based concepts);
* ANN-based feature extraction from the speech signal.
newarchit-newcolor-17jan20.png