Skip to main content
European Commission logo print header

Information Extraction for Everyone

Periodic Reporting for period 3 - iEXTRACT (Information Extraction for Everyone)

Reporting period: 2022-05-01 to 2023-10-31

The project aims to expose advanced text-mining and language understanding techniques such that they can be used by domain experts who do not have a computer science, programming, linguistics or machine-learning background.
This effort is two pronged: on the one hand, building the right software and abstraction layers such that existing state-of-the-art techniques are available without requiring technical background, and on the other hand identifying missing pieces in current language-processing and automatic text understanding techniques that can and should be improved, and improving them.

Providing domain experts (e.g. experts in biology, chemistry, political sciences, etc) access to text-mining techniques is of paramount importance to society: the amount of text in the world, especially in scientific and legal domains, is growing in an incredible pace, and it is currently nearly impossible for such experts to stay on top of the literature beyond their narrow field of expertise. Using text-mining and automatic text processing tools, can help these scientists, researchers and policy-makers tame the amounts of text and extract useful information from them.
Since the beginning of the project, we created and launched a search system called SPIKE, which allows a new form of search which we call "extractive search". Rather than searching and retuning a set of documents, extractive search aims to allow the user to specify and extract "information nuggets" from within each matched document, and to aggregate them. The tool has been used by NLP experts for various means, and we are now exploring the use of this tool with several non-CS academic groups.

The SPIKE tool is accessible at: https://spike.staging.apps.allenai.org/

Beyond the SPIKE tool and its innovative interface and query languages, we also identified a core area in which current language-understanding systems are lacking: the processing of implicit and underspecified information. For example, in a sentence like: "Water enters the plant through the roots", it is not specified that "the roots" are "of the plant", yet human readers immediately infer this. Automatic systems, on the other hand, are struggling. We now lead a research line that formally defines kinds of such implicit information, and creates annotated training datasets covering these phenomena. The availability of the formal definitions and datasets will both allow us to develop better automatic language understanding modules, and also encourage the wider research community to tackle this area.

Finally, over the past few years, a new family of modeling techniques ("transformer based pre-trained large language models") is emerging as very promising, and we are looking into its capabilities, its limitations, and how it can be used in the context of our project going forward.
We created the SPIKE extractive search system, which is a new paradigm in search. The system is already showing utility in the hands of domain experts, and is continually improving. We are now working on extending it with further capabilities, based on new advancement in deep-learning and NLP, and also work on improving language understanding of implicit elements, which will also be integrated into the system.