Skip to main content

Unlocking topicality in text - foreground and background information in written language

Objective

This language technology project aims to bridge the gap from clausal syntax to text, and show how the syntactic mechanisms of the language indicate topical themes in text. The project will investigate a large number of texts using both human assessments of foreground and background statements and state-of-the art syntactic analysis tools to chart known and newly found systematic differences between how foreground and background themes are presented.
This language technology project aims to bridge the gap from clausal syntax to text, and show how the syntactic mechanisms of the language indicate topical themes in text. The project will investigate a large number of texts using both human assessments of foreground and background statements and state-of-the art syntactic analysis tools to chart known and newly found systematic differences between how foreground and background themes are presented.

OBJECTIVES
A bottleneck for improving today's information management systems is that we know little of texts as text. Systems view texts as simple sets of words or terms, discarding information such as clause style and argument structure as noise. This project aims to bridge the gap from syntax to text, and show how syntactic mechanisms of language, which primarily concern clause-internal structure, carry text-level information as well. Once we are able to chart some features of the topical progression in a text we will give a road map for algorithms for further processing: indexing and search, summarisation, report generation, and optical text recognition are all application areas which would benefit from better knowledge of what makes texts.

DESCRIPTION OF WORK
We will take a large number of texts in several languages and partition the clauses in them into a number of graded categories according to foregroundedness. These clause categories can then be used in different ways for indexing, multi-document summarization, and text item similarity calculation. This first assessment project takes the form of an experiment on text. If the experiment is successful, it opens up an entire research field, which we will continue examining in a future project.
1. Assemble corpus. If possible we will use the multilingual TREC corpus.
2. Define prototypical clause types based on our theory of foregroundedness.
3. Use human test subjects to partition clauses according to prototypical type.
4. Find and explain formal differences between types of clause as shown by test subjects, based on theory of transitivity.
5. Build tools to automatically identify clause types.
6. Index large number of texts using tools, and run test sets of information retrieval queries.
7. Result dissemination.
8. Plan for continued and refined experimentation.

Funding Scheme

CSC - Cost-sharing contracts

Coordinator

SWEDISH INSTITUTE OF COMPUTER SCIENCE
Address
Isafjordsgatan 22
164 29 Kista
Sweden

Participants (1)

CONEXOR OY
Finland
Address
Porrassalmenkatu 19 A 15
50100 Mikkeli