This site has been archived on
The Community Research and Development Information Service - CORDIS
Information & Communication Technologies

Language Technologies


Back to overview

Please note that the project factsheets will no longer be updated.  All information relevant to the project can be found on the CORDIS factsheet .  This is updated on a regular basis with public deliverables, etc.

GALATEAS - Generalised Analysis of Logs for Automatic Translation and Episodic Analysis of Searches

250430 - Pilot B

galateas-logo.gif

At a glance

PSP -2009.5.1 - Machine Translation for the Multilingual Web

Challenge

Every day millions of search queries are issued to content providers, ranging from all-purpose web information sites, digital library sites to vendor sites. From a careful analysis of such queries, content providers could understand what is the information that users are really looking for, what are the current strategies to retrieve digital objects, and to what extent the content offered by the web site matches the needs of end users. The existing services offer tools to segment user queries in words and provide statistics about the occurrences of single words, but this is far from satisfying content provider needs: they only consider words as chains of characters, they do not perform any matching between user searches and the informative backbone of the content aggregation; each search is seen as an isolated event and there is no attempt to determine sequential patterns of search activities.

Goal

GALATEAS aims at providing the first query oriented machine translation system. Such a technology will allow any user speaking one of the GALATEAS' languages to type a query in his/her mother tongue and retrieve documents/metadata in several languages. By providing support to query log analysis for at least seven languages, GALATEAS will enable content providers to understand what their users are looking for and in which language: this means, for content providers, the possibility of a much more targeted acquisition of new content; for users, this will imply access to more pertinent content.

Innovation

Unlike mainstream services in this field, GALATEAS services will not consider the standard structured information in web logs (e.g. click rate, visited pages, users' paths inside the document tree) but the information contained in queries from the point of view of language interpretation. From the point of view of machine translation GALATEAS will investigate technologies enabling statistical machine translation systems to provide meaningful results for short, syntax poor, uncontextualised texts such as queries to search engines.

The result

GALATEAS will set up two web services:

  • LangLog : it will analyse transaction logs containing queries to search engines for a given content provider. By applying statistical technologies coupled with language oriented services, it will produce reports concerning the informational needs of the users accessing that particular aggregation. In other words, the same way in which standard log analysis systems provide generalisations of paths of users inside a web site, LangLog will provide generalisations of the actions that information seekers perform in order to find content inside a searchable collection of digital objects. LangLog will provide services for at least seven languages, namely Italian, French, English, German, Dutch, Modern Arabic and Polish.
  • QueryTrans : it will translate queries coming from an external search engine into several target languages: the external search engine will use these translations to return to the user results into languages different from the one in which the query was formulated.

Impact

The proposed solution will significantly reduce costs of owneship for integrating cross language retrieval solutions. It will maximize the quality of the retrieval if compared to standard translation services, which are known to perform badly on syntax poor texts such as queries. Customers of GALATEAS will be organisations running content delivering web sites powered by a search engine (digital libraries, content aggregators and merchant sites).

Where will the project be present?

Scientific conferences in the field of NLP and DL.

Presence in major trade fairs in the field of IT support to digital libraries

CeBIT and SMAU in 2012 and 2013

 

Co-ordinator

Contact Person:

Name: Frédérique Segond
Tel: +33(0)476615076
Fax: +33(0)476615199
E-mail

Organisation: XEROX

More»


 

 

 

 

 

 

 

 

 

 

 

 

Back to overview



This page is maintained by: Susan Fraser (email removed)