CORDIS - EU research results

Efficient and Effective Evaluation of Information Retrieval Systems

Final Report Summary - EFIREVAL (Efficient and Effective Evaluation of Information Retrieval Systems)

Information retrieval (IR) is the study of methods for organizing and searching large sets of heterogeneous, unstructured or semi-structured data. The growth of the Web and the subsequent need to organize and search vast amounts of information has demonstrated the importance of IR, while the success of web search engines has proven that IR can provide valuable tools to manage large amounts of data. Evaluation has played a critical role in the success of IR. The development of models, tools and methods has significantly benefited from the availability of test collections formed through a standardized and thoroughly tested methodology, known as the Cranfield model or paradigm. However, the methodology is nearly fifty years old and is beginning to show its age.

The profound success of web search at addressing everyday information needs suggests to the casual observer that search is a solved problem. However, this is far from the truth. The need for search arises from an increasingly broad range of challenging information environments, each requiring an IR system to support distinct types of search task, such as spam filtering, desktop search, scientific literature search, law and patent search, business intelligence, enterprise search, and expert search. In the past ten years, the IR research community has realised that different IR systems are needed for different search tasks. Even within specific areas, such as enterprise search, the content and needs of one enterprise are, commonly, so different from those of another that search engines must be tuned to each enterprise. Assessing the differences, creating new systems and tuning them only takes place if testing materials (e.g. test collections) are to hand. Given the huge range of potential search environments and individual organisational differences, it is not unreasonable to estimate that thousands of test collections are currently needed by the IR research and commercial search communities. Currently, fewer than fifty are commonly used. This is because constructing large numbers of Cranfield-style collections is infeasible due to the prohibitive cost required to build such collections. Even more problematic is that the Cranfield methodology is incapable of allowing search engine developers to capture complex retrieval environments, or assess searchers' interactions with retrieval systems.

The end goal of this project is to overhaul the Cranfield methodology and provide a new evaluation paradigm that will overcome the aforementioned issues and lead to efficient, effective and reliable evaluation.

Thus, the primary objectives are to:

1. establish an efficient and reliable test collection construction methodology;
2. develop a framework to adapt existing test collections to new environments;
3. extend the current evaluation paradigm to incorporate richer information about the different retrieval scenarios and searchers' activities.

A test collection can be thought of as a sample from some hypothetical universe of possible documents and queries. Retrieval systems under testing produce a ranking of the sample of documents for each one of the queries. Relevance labels are then acquired for a subset of the query-document pairs by human raters. In recent work (including our own), that has been broadly adopted by the research community, query-document pairs consist a stratified sample from different system rankings. The human raters can also be thought of as a sample from a user population, especially if they come from a large pool of workers, such as found in a crowdsourcing environment. Evaluation measures accumulate the relevance of the ranking by first simulating the interactions of the user over this ranking and then accumulate the utility that this user obtains through this interaction. Most modern evaluation measures are constructed with a parameterized probabilistic user model in mind. This reflects a user working their way down a ranked list of results and examining each document in the ranking based on some probability. Different values for parameters of the model can represent different subsets of the user population. As is common in statistical analysis of any experimental results, we assume that the sample that we have is in some way representative of the universe, which we take to be large or infinite. The primary object of statistical analysis is to draw inferences about this universe.

Our first objective was to develop a framework that would allow us to decompose the variability in the results of an experiment to the different sources (e.g. how different would the results be if a different set of queries, or if a different set of user interaction parameters had been used), and make statistical inferences and comparisons between retrieval systems or algorithms. To study the different sources of variability we developed a number of methods to simulate sampling from a population of documents, queries and users. Bayesian methods over large user-system interaction logs summarized the variability of user interactions, and modeling score distributions allowed the simulation of document-query sampling. Mixed-effects model theory was then used to decompose the variance into different components and allow for accurate inferences about the universe of all documents, queries and users.

Our second objective was to examine which characteristics of a test collection would allow us to train new and effective retrieval algorithms. If there are certain characteristics that are significant for training, even under different retrieval environments, then one could ensure that training document collections always exhibit these characteristics, even if the application domain is different than the training domain. We discovered that (a) selecting query-document pairs to label with a particular methodology can lead to much more effective ranking functions than the methodology used nowadays; and (b) optimizing for a new evaluation measure we developed could also lead to more effective ranking functions than the state-of-the-art.

Finally, our third objective was to extend the current evaluation paradigm by incorporating richer information about the searcher's activity. First we developed two new test collections. The development of these collections was sponsored by the U.S. National institute of standards and technology (NIST) through the text retrieval conference (TREC) initiative. The two collections went beyond the single query-response paradigm, which tests how well a retrieval system responds on a per-query basis (this is unrealistic as it does not reflect how people search in practice). Instead our new approach allowed us to model entire sequences of user-system interactions with the retrieved results (called sessions) with the goal of developing retrieval systems that optimize performance across the entire user experience (not just single query-response). We also developed measures that can capture the quality of retrieval systems over entire sessions.

The main results achieved are:

(a) the introduction of mixed-effects models in the information retrieval literature and experimental methodology;
(b) the construction of low-cost test collections and the study of their generalizability and reusability; and
(c) the development and dissemination of two large test collections for researchers to study how to develop system that can optimize for the whole-session user experience.

The test collections we developed have been used by many research teams all over the world (10 and 13 research teams participated in the user session based retrieval competition we organized through TREC in 2010 and 2011 respectively). We organized workshops to further discuss research directions, tutorials and summer school lectures to disseminate the newest development to students and researchers. We published more than 15 conference and journal articles in the highest-quality information retrieval conferences (8 in SIGIR, the most prominent IR conference in the discipline).