Periodic Reporting for period 3 - iEXTRACT (Information Extraction for Everyone)
Reporting period: 2022-05-01 to 2023-10-31
This effort is two pronged: on the one hand, building the right software and abstraction layers such that existing state-of-the-art techniques are available without requiring technical background, and on the other hand identifying missing pieces in current language-processing and automatic text understanding techniques that can and should be improved, and improving them.
Providing domain experts (e.g. experts in biology, chemistry, political sciences, etc) access to text-mining techniques is of paramount importance to society: the amount of text in the world, especially in scientific and legal domains, is growing in an incredible pace, and it is currently nearly impossible for such experts to stay on top of the literature beyond their narrow field of expertise. Using text-mining and automatic text processing tools, can help these scientists, researchers and policy-makers tame the amounts of text and extract useful information from them.
The SPIKE tool is accessible at: https://spike.staging.apps.allenai.org/(opens in new window)
Beyond the SPIKE tool and its innovative interface and query languages, we also identified a core area in which current language-understanding systems are lacking: the processing of implicit and underspecified information. For example, in a sentence like: "Water enters the plant through the roots", it is not specified that "the roots" are "of the plant", yet human readers immediately infer this. Automatic systems, on the other hand, are struggling. We now lead a research line that formally defines kinds of such implicit information, and creates annotated training datasets covering these phenomena. The availability of the formal definitions and datasets will both allow us to develop better automatic language understanding modules, and also encourage the wider research community to tackle this area.
Finally, over the past few years, a new family of modeling techniques ("transformer based pre-trained large language models") is emerging as very promising, and we are looking into its capabilities, its limitations, and how it can be used in the context of our project going forward.