Skip to main content

EXTRACTION OF CONTENT: RESEARCH AT NEAR-MARKET

Objective

The aim of ECRAN is a new generation of information extraction applications for telematic services with substantial textual content, like interactive T.V. Intended for professional users and the general public, the applications will rely on the development of on-line textual content in financial information from specialised newswire services, and market data on the Internet. Extraction of information for a given customer will be based on a personal user profile. The project will also develop lexicon-tuning technology for information extraction, enabling the linguistic knowledge of the system to be extended and adapted to new text domains automatically.
Market Studies
In the first half of the year ECRAN completed market study and user requirements analysis for two applications of Information Extraction (IE).
For news filtering, information extraction is a tool for which there is already a clear demand in the market but there are only a handful of products available for English, generally of limited quality. Feedback from the users also emphasised IE as a crucial tool for providing on-line television services with access to usable information. User requirements have placed a special emphasis on the need for timeliness and precision of information.
The initial marketing potential of ECRAN lies with the top 500 companies and in specialist markets such as information providers and the financial industry. This is because of today's development cost of the technology. A market already exists for IE on a very small scale in the US but the products are expensive and limited to the English language. ECRAN specifically aims to break these two limitations. Availability of good IE products for Internet access has also been identified as an absolute necessity.
The Market for News Filtering
The market for ECRAN's news filtering is based on the interests of experts from a range of financial institutions. Consultants and fund managers in companies, such as Union Invest of Frankfurt, are interested in automated signalling procedures which apply when specific events occur (e.g. interest rate rises more than 0.3%). Insurance companies, such as HDI, Hanover, have an interest in a procedure which looks for specific legal facts (e.g. fire cases with injuries) in databases. Database providers and news agencies, such as Creditreform and VWD, Eschborn, are interested in extracting company information (financial figures, business development, management successions) in order to fill a database and to route specific information to appropriate clients groups (chief economists, financial analysts, board of chairs, competitive intelligence professionals).
The Market for Interactive Television
A number of factors are transforming the television business, in such a way that Thomson is considering IE as a key technology for interactive television. These factors include the explosion in the number of channels, the recent release of standards for electronic TV guide management, and the rapid growth forecast in the market for TV sets handling this standard. In the new market of 'interactivity', where personal computers and network computers are competing, several TV set manufacturers including Thomson want to enhance traditional TV sets with new functionality and 'intelligence'. Automatic handling of an electronic TV guide with the help of IE technology is a particular target.
User requirements
Investigation of user requirements for the news filtering domain illustrated a good understanding amongst users of the significance and benefits of IE. The filtering concept is well understood and an anticipated development, especially in scenarios such as a user describing a complex factual information need once only and receiving the results delivered each time such a fact occurs in the news. In particular, a better use and exploration of electronically stored text documents inside the company was thought to be very important, requiring integration in Intranet systems.
With the anticipated costs for such a new technology in mind, users argued that precision must be very high and the range of allowable information sources should be very broad (including databases, news wires, intranets, the Internet and workgroup network computing). Information delivery, should be ergonomic and customisable to the exact needs of the user - the US news filtering services Pointcast and NewsPage are seen as good examples in this area.
Information Extraction Technology Development
Basic prototypes have all ready been developed for extraction of company news information in English from the World Wide Web, for lexical tuning in Italian, and for demonstrating extraction of information about movies in French. These will be refined and integrated to yield fully working prototypes for IE and lexical tuning in English, French and later Italian. User profile processing modules will first be integrated into the English prototype.
The English language IE system is being ported to handle information about films/videos in French and to Italian company news. The mock-up demonstrates integration of IE in the Newscan/Internet system. Newscan on Internet is a news filtering service developed and run by Smart Information Services GmbH. The Newscan search engine collects business news from around 300, mostly German, Web sites and allocates it to 150 or so industrial categories. Users of Newscan can build their own filters and receive the filtered news by daily Email.
The information extracted from management succession news will be displayed in a tabular summarised form inside a dedicated category of the service (Management Succession or 'Personalveränderungen'). It will be updated weekly. It may also present additional categories with extracted information on specific topics in the same way (joint ventures, economic indicators, financial figures of companies). These categories and the presentation of information provide precise and selected business information in a summarised format for quick access.
Lexical Tuning Technology
A common cross-language lexical format has been designed to support the different approaches to lexical tuning and exploitation in IE systems. A set of lexical tuning methods has been established, together with the definition and content of the tuning process. The design of evaluation methods (both for IE and lexical tuning) is under way. A partial mock-up of this is being integrated for Italian.
User and Domain Modelling
The results of analysing corpora and user expectations and profiles have formed the requirements specification both for the modelling of knowledge about the domain and users, and for the interaction with the user. Development of the user modelling component has already begun and a basic demonstrator is ready. The prototype includes a domain knowledge base, user models, a set of tools for managing knowledge and presenting it to the user in a simple form.
The Way Ahead
The next year will be devoted to the implementation of lexical tuning methods. An essential task will be evaluating lexical tuning by means of resulting improvement in the IE process. There will be substantial integration work before the prototypes can be tested by end-users. The ECRAN prototype will then be tested by Thomson to dynamically build personalised electronic TV program guides from stock textual sources, and feed new prototypes of interactive TV set interfaces.
The cost of current Information Filtering (IF) systems restricts their usage to professional users in specific areas. A customizable system would have the potential to support a wider range of applications, and more effectively give access to information from various market sectors (scientific/technical, travel services, TV-shopping/programs, …) to corporate and other users.

The project will develop a comprehensive and integrated IF system-based on technologies developed by the experienced academic partners: Information Extraction systems, validated by a high score in the MUC benchmarks, Lexicon Tuning technology explored by some of the ECRAN partners …,

The system developed during the project will better meet the needs for automatically filtered information from online sources, due to the tuning technology developed. It will be available for different EU languages (English & French as main languages, and additionally Italian). It will first address the needs of professional users of financial or market information services. At mid-term, it will be possible to use the technologies developed in this project for applications accessible by a wider range of citizens.

The typical application scenarios will be developed during the validation phase where the integration of the ECRAN prototype in commercial information filtering services on users' sites will allow a measure of performance (through recall and precision).

Progress and Results

The results to be achieved include a complete IE system, including lexicon tuning technology. This will be applicable to other application areas, shifting to new domains only requiring customization actions.

At the end of the second year, an intermediate prototype including the results of the first phase will be demonstrated.

Results of the project will be accessible through a repository site. Three technical reports will be made available to the community:
- "definition of lexical formalisms", by June 96
- "Heuristics for automatic tuning", by March 97
-"Establishing base lexicon from corpus", by May 97

Exploitation

The prototype developed by the ECRAN project should be completed in August 1998 and will pass through a validation phase of at least four months of extensive use in real contexts. This validation phase will prove the interest of the project results in the information industry, and provide directions for future development and deployment of this technology to new applications and sectors.

A set of software tools and technologies applicable to Information Extraction, Lexical Knowledge Base modelling and manipulating, will be produced by ECRAN. The baseline concept could be used in online IE/browsing servers on the Internet or for other new-generation Teletext norms. Further deployment may involve IE software agents for teleservices, supported by Interactive Television applications.

Funding Scheme

CSC - Cost-sharing contracts

Coordinator

Thomson-CSF
Address
Domaine De Corbeville
91404 Orsay
France

Participants (4)

National Centre for Scientific Research Demokritos
Greece
Address

153 10 Aghia Paraskevi
SMART INFORMATION SERVICES GMBH
Germany
Address
105B,heinrich Mann Allee
14473 Potsdam
University of Sheffield
United Kingdom
Address
Regent Court Portobello Street
S1 4DP Sheffield
Universià degli Studi di Ancona
Italy
Address
Via Brecce Bianche 65
60131 Ancona