Resource selection and data fusion for Multimedia INternational Digital libraries

Informacje na temat projektu

MIND

Identyfikator umowy o grant: IST-2000-26061

Projekt został zamknięty

Data rozpoczęcia 1 Stycznia 2001

Data zakończenia 31 Grudnia 2003

Finansowanie w ramach

Programme for research, technological development and demonstration on a "User-friendly information society, 1998-2002"

Koszt całkowity

€ 1 457 824,00

Wkład UE

€ 874 684,00

874 684,00

583 140,00

Koordynowany przez

UNIVERSITY OF STRATHCLYDE
United Kingdom

CORDIS oferuje możliwość skorzystania z odnośników do publicznie dostępnych publikacji i rezultatów projektów realizowanych w ramach programów ramowych HORYZONT.

Odnośniki do rezultatów i publikacji związanych z poszczególnymi projektami 7PR, a także odnośniki do niektórych konkretnych kategorii wyników, takich jak zbiory danych i oprogramowanie, są dynamicznie pobierane z systemu OpenAIRE .

Rezultaty

Resource selection is the task, which decides to which digital libraries a query is forwarded. The MIND project has extended the decision-theoretic framework. Current approaches (e.g., GlOSS, CORI) rank the libraries w.r.t. their similarity to the query. In contrast in MIND costs combining retrieval quality, time and money are considered. Costs from these different sources are weighted with user-specified parameters to allow for different selection policies (e.g. good results, cheap results). The task then is to find an optimum selection (for each library, the number of documents to retrieve from that library). Several enhancements of this basic approach have been developed in MIND: firstly, a logistic function (instead of a linear one) is used for mapping RSVs onto probabilities of relevance. Secondly, several new methods for estimating retrieval quality (e.g. performing retrieval on a sample, assuming a normal distribution for the indexing weights, incorporating CORI) have been developed. The evaluation shows that this model is at least competitive with, and in most cases better than CORI. The decision-theoretic framework has been implemented and successfully integrated into MIND.

A wrapper is a piece of software which receives a query from another component (in a standard format), transforms it, sends the transformation to a third-party digital library (e.g. via HTTP GET or POST), receives documents (in most cases, HTML documents) from the library, transforms them into a standard format (by extracting information from the HTML document), and returns the transformed documents. MIND project has developed wrappers for several third-party digital libraries: CNN, Times Online, National Gallery of Arts (Washington, DC), Metropolitain Museum of Arts, Fine Arts Museum (San Francisco). Web Gallery of Arts and Google (using their API). The project has also modified wrappers from the Daffodil project for ACM, DBLP, Citeseer and Achilles.

When distributed, heterogeneous digital libraries have to be integrated; one of the crucial tasks is to map between different schemas. As schemas may have different granularities, and as schema attributes do not always match precisely, a general-purpose schema mapping approach requires support for uncertain mappings. MIND has developed a declarative approach for defining and using uncertain schema mappings. Queries and documents are converted from the MIND model into DAML+OIL, a Semantic Web ontology language. The uncertain rules are expressed in probabilistic Datalog since DAML+OIL (like similar ontology languages) lacks rules. The DAML+OIL model is serialised in XML, and the rules are transformed into XSLT style sheets for actually transforming queries and documents. This declarative approach is fully implemented in the project MIND.

Data fusion refers to the process by which matching scores of documents returned by different resources are normalized with each other so as to enable comparison of their relative relevance. Within this context, a model learning solution has been defined which is based on two separate steps: normalisation coefficient learning and normalisation coefficient approximation. Normalisation coefficient learning (NCL) is carried out once, separately for each resource: it requires an exchange of information between the resource and the data fusion server. This exchange of information takes place in the form of a set of queries that the data fusion server sends to the resource. For each query, the resource returns a list of retrieved documents with associated matching scores. These are analysed by the data fusion server that estimates values of the normalisation coefficients, which transform un-normalised resource matching scores into normalised matching scores. On the other hand, normalisation coefficient approximation (NCA) is carried out only at retrieval time. It allows, given a query and results provided by a resource for that specific query, normalisation of matching scores associated with retrieved results. This is achieved by using information extracted from that resource during NCS. In doing so, data fusion is achieved through a model learning approach: during NCL, parameters of the model are learned; during NCA, the learned model is used to accomplish data fusion of new results.

Digital archives (resources in MIND) are collections of documents. Documents are associated with some type of descriptor to represent their content. Several types of descriptors can be used for each media type. In general, each digital archive stores these descriptors in a proprietary format, which conforms to the specification and purpose of the specific library. In order to overcome this heterogeneity of representation, Resource Descriptors (RDs) are computed for each digital library. These RDs abstract the library content by extracting salient features, and organize and store them in a common format at the proxy level. In this way the system is provided with a summary description of the library it represents, and this information can be used for the tasks of querying, resource selection and data fusion. In particular, RDs should store information about resource content that are necessary to fulfil dispatcher requests. Representation of database content for image archives relies on the use of feature vectors for image content description. Following this approach, it is assumed that the content of a generic image is described through a feature vector whose entries represent relevant visual properties of the image. Representation of database content is carried out by sampling the database and organizing sampled items in a metric access structure. At the lowest resolution level (level-0), feature vectors of sampled items are clustered. Each cluster is represented through its central vector, called a routing vector. Each routing vector is associated with the following information: - Maximum distance (cluster-Radius) between the routing vector and a generic vector belonging to its cluster; - Number of vectors belonging to its cluster (cluster-Population). At the next resolution level, all routing vectors are regarded as generic feature vectors: they are subject to a clustering process, the central vector of each cluster representing a level-1 routing object. The clustering process continues until a single cluster is selected.

Information retrieval is becoming increasingly concerned with resource selection and data fusion for distributed archives. In distributed information retrieval, a user submits a query to a broker, which determines a solution for how to yield a given number of documents from all available resources. The MIND project has developed a multi-objective model for resource selection, in which four aspects: a document's relevance to the given query, time, monetary cost, and the chance of getting document duplicates from resources, are considered simultaneously. Some variants of this multi-objective model, aimed at achieving better implementation efficiency, are also proposed.

Wyszukiwanie danych OpenAIRE...

Rezultaty

Pobierz Pobierz zawartość strony