Periodic Reporting for period 4 - SEQCLAS (A Sequence Classification Framework for Human Language Technology)
Reporting period: 2021-02-01 to 2021-07-31
Speech recognition, machine translation and text image recognition are key techniques that are needed in a large number every-day situations:
* information access and management:
Due to the progress in information technology, huge amounts of unstructured speech and text data are now available in the worldwide web and the archives of companies, organizations and people. This data may exist in three type of forms:
1. speech data in audio and video documents;
2. digital text as in books, newspapers, patents, word-processor documents and e-mails;
3. image text (printed or handwritten) in scanned books and documents.
* human-machine and human-human communication using speech:
applications that can support humans in communication (customer help lines, call centers, e-commerce, human-human communication using speech translation, question-answering systems, ...).
Despite the huge progress made in the field, the specific aspects of sequence classification have not been addressed adequately in the past research. Instead of developing specific solutions for each particular task independently of each other, our approach is to identify fundamental problems in sequence classification that show up across the three HLT tasks and to come up with a novel unifying framework. In agreement with the proposal, the work carried out was organized in 5 tasks:
* Task 1: a theoretical framework for sequence classification:
In principle, the starting point for virtually all successful approaches to sequence classification is the Bayes decision rule for optimum performance. In practice however, many simplifications and approximations are used, and it is not clear how they affect the final performance. We have done first steps towards developing a theoretical framework around the performance criterion that gives answers to questions like: How does the system performance depend on the language model?
* Task 2: consistent modeling:
Sequence classification requires probabilistic models whose parameters are learned from the training data. In this task, we put emphasis on the requirement that these models should be exactly the same in training and testing. The important result of this task is that we have developed a framework that allows a direct combination of deep learning and alignment concepts introduced for generative Hidden Markov models. This framework of neural Hidden Markov models has been applied both in speech recognition and machine translation. These approaches show competitive results with the state-of-the-art methods. In addition, we have worked on neural net based feature extraction directly from the speech time signal waveform, which in the future could improve the traditional spectral analysis.
* Task 3: performance-aware training and modelling:
What matters ultimately for a sequence classification system is its performance. Most activities in this task have been concerned with language models and acoustic models. In language modelling, we have achieved significant improvements by using refinements of recurrent neural networks. For acoustic modelling, we have improved the concept of sequence discriminative training, which resulted in significant improvements of the recognition accuracy. In addition, we studied the various ways in which the language model can be integrated into the training of an acoustic model or translation model in speech recognition and machine translation, respectively. For speech-to-text translation we studied an improved ASR-MT interface that better passes on the ambiguities of the ASR results to the MT engine in the cascaded approach.
* Task 4: unsupervised training:
In conventional MT, the required parallel sentence pairs for training can be expensive.
Therefore the question comes up: what can be done in a situation where no explicit sentence pairs are available, but only monolingual data in each of the two languages. This situation is referred to as fully unsupervised training. There is an intermediate situation called semi-supervised training where some amount of parallel sentence pairs is available in addition to a huge set of monolingual sentences in both the source and target languages. In this task, we worked on both types of unsupervised training.
For fully unsupervised training , we have achieved results for bidirectional English-German translation without using any parallel data for training. The training of the system was based on monolingual data only in combination with cross-lingual word embedding and iterative back-translation. This system achieved the first position among other unsupervised translation systems in the evaluation of WMT 2018. Later work was concerned with semi-supervised training where various methods like back translation, sequential transfer and cross-lingual word embeddings were studied. With these methods of semi-supervised training, it is possible to obtain remarkable translation performance using only a small amount of parallel sentence pairs for training.
* Task 5: research prototype systems:
The methods and algorithms developed have been integrated into the team's high-performance systems and have been evaluated on public databases and international benchmarks:
* general software toolkit (RETURNN) for sequence-to-sequence modelling for HLT tasks;
* for speech recognition: CHiME, Switchboard, Librispeech, TED-LIUM, WSJ;
* for machine translation: WMT tasks and IWSLT tasks 2017-2020.
There are five directions that we consider to be the most important ones:
* a theoretical framework and bounds on classification error;
* integration of language modelling into acoustic modelling for ASR and into translation modelling for MT;
* a unifying view of finite-state based acoustic modelling which allows us to handle various types of acoustic models (like hybrid HMM, CTC, RNN-T) in the same mathematical framework;
* direct neural hidden Markov models for machine translation (rather than attention-based concepts);
* ANN-based feature extraction from the speech signal.