Index based Statistical Analysis of Large Text Corpora

Final Report Summary - ISALTEC (Index based Statistical Analysis of Large Text Corpora)

Summary of the Project Objectives
The scientific goal of ISALTeC was to initiate a fundamental study on an index structure for natural language processing that is able to compactly represent left and right context of arbitrary length. As such ISALTeC set four immediate objectives. Based on the index structures, the main objective of the ISALTeC project was the development and implementation of methods for extraction of meaningful phrases from raw-text data. Three side objectives were the applications of the developed techniques to machine translation of phrases, detection of text-reuse in the E-Humanities and detection of important genome subsequences.

Extraction of meaningful phrases. We successfully developed and implemented an unsupervised algorithm, which does not rely on any linguistic knowledge and works on character level, that exploits the index structure to achieve this objective. It is based on three natural assumptions: (i) phrases occur in different contexts; (ii) sentences/paragraphs are built of possibly overlapping phrases; (iii) phrases overlap on the so called function words -- words, e.g. prepositions, articles, pronouns, that bear little sense alone but enable the smooth grammatical transition from one phrase to another.

These general considerations about natural language were transformed into a mathematical model for the "phrases" of a corpus based on statistical evidence. Using the index structure and the phrase model we managed to efficiently analyze datasets of different size (scaling from a couple of MB to several GB), different languages (e.g. English, German, French, Dutch, Spanish, Russian, Bulgarian, Finnish, Arabic, Chinese) and different domains (e.g. politics, philosophy, Wikipedia). In all settings the algorithm successfully determines a set of function words and corpus-specific continuous phrases.

We further enhanced our approach to decompose longer continuous phrases into shorter ones as far as a resulting subphrase can be decomposed. For most languages (Chinese and languages with similar writing system) we obtain an hierarchical representation for the entire corpus. In the basic layer of the hierarchy are usually corpus-specific frequent words. The further layers of the hierarchy reveal how simpler phrases or words can be combined to form more complex phrases. In this way we obtain a network that enables a systematical exploration of characteristic and important "language pieces" (words, phrases, subphrases) of the corpus. These studies and algorithmic results are valuable for communities dealing with Big Text Data and Data Mining. They can benefit from our methods to structure corpora and obtain a general overview of the information represented by raw-text data.

Machine Translation of Phrases. The objective of this task was to investigate to what extent the contextual information can improve the translation of phrases. Based on the results achieved from the first task of the project, we developed several techniques for extraction of translation equivalents, including discontinuous phrases. We also developed an efficient formal algorithm for extraction of discontinuous phrases based on the traditional Statistical Machine Translation notion of phrases. However, we were unable to apply the extracted translation equivalents to an end-to-end translation of phrases.

In this task we departed from a bilingual corpus where each sentence in the first language was assigned to its translation in the second language. In order to extract continuous translation pairs we developed two approaches. A general and straightforward approach uses the techniques of the first task. It extracts the phrases represented in each of the two monolingual parts of the corpus and then uses the given correspondence of sentences to gather statistical evidence which phrases are likely to co-occur. In a second approach we did not differentiate between the languages. Instead, we reversed the original sentence and attached to it its translation. In this way the beginnings of both sentences come close to each other and are facilitated to form joint "bilingual phrases". Hence, applying our original technique for phrase detection we can directly spot good translations of initial phrases. A similar construction can be applied to represent matching final phrases. In practice, we represent both initial and final phrases in both languages. Both techniques yielded excellent results as to translation pairs of continuous phrase. Besides capturing single words, our results extend to multi-word phrases that should not represent literal translations or preserve the word order.

However, these approaches are biased to continuous phrases. Therefore, if a common phrase is split by, say an emphasising expression, the method will fail. To address this issue, we developed an approach that builds a dual hierarchical structure. On the one side it computes parallel phrases, which are not necessarily continuous. On the other side it provides set of sentences where these phrases co-occur. The process starts with a phrase or subphrase in one of the two languages and the set of sentences in which it occurs. At each step it is triggered by a statistical evidence for a phrase as well as a combinatorial evidence that the phrase occurs in different contexts in the selected instances.

An additional problem that we looked at was the ambiguous translation of words or phrases. Thus, one and the same phrase can be translated as different synonyms or expressed in a paraphrastic way. This suggests that we have to deal with classes of (almost) equivalent phrases and their translations and not merely with individual phrases or words. Accordingly, we developed a technique that spots such classes. It naively assumes that different translations of the same phrase should be similar in meaning and propagates this assumption along matched pairs of phrases. This technique results in nice semantically related classes in both languages.

Finally, we solved a formal problem suggested by Alexander Fraser -- the team leader of the Machine Translation Group at CIS. In particular, we developed an algorithm with optimal complexity for extraction of phrase-candidates with a fixed number of gaps. The algorithm is almost optimal in the general case.

Text-reuse in E-Humanities. The objective of this task was to apply the automatically extracted phrases to the topics and problems of interest in E-Humanities. It turned out that our general technique for extraction of phrases can be easily enhanced to spot large passages of almost repetitive text. In such cases our algorithm spots very long strings as phrases. This effect is beneficial for detection of text-reuse. We characterised each sentence/paragraph with the phrases in which the algorithm decomposed this sentence ordered by length and frequency. In this way we rearrange the sentence in a sequence of phrases so that the suspiciously long and infrequent phrases come first. Using simple matching techniques and thresholding we can then detect passages with high orthographic similarity. Ideas along these lines gave rise to a PhD project that studies advanced alignment methods for similar documents using the index structure.

The results can be of interest for institutions and organisations that wish to detect suspiciously similar documents, e.g. different versions of laws.

Automatic detection of important genome subsequences. The objective of this task was to verify to what extent the "phrase structure" of written texts resembles the arrangements of amino-acids in bioinformatics. We collaborated with our colleagues at RostLab at TU Munich and set two experiments on two different sets of protein sequences. On each of the sets we applied our algorithm for detection of phrases. Technical, it behaved much more differently than on language data and human inspection led to the conclusion that it did not manage to detect known regularities in the biological data.

Final Report Summary - ISALTEC (Index based Statistical Analysis of Large Text Corpora)

Share this page

Download