Periodic Reporting for period 4 - iEXTRACT (Information Extraction for Everyone)
Reporting period: 2023-11-01 to 2025-04-30
This effort is two pronged: on the one hand, building the right software and abstraction layers such that existing state-of-the-art techniques are available without requiring technical background, and on the other hand identifying missing pieces in current language-processing and automatic text understanding techniques that can and should be improved, and improving them.
Providing domain experts (e.g. experts in biology, chemistry, political sciences, etc) access to text-mining techniques is of paramount importance to society: the amount of text in the world, especially in scientific and legal domains, is growing in an incredible pace, and it is currently nearly impossible for such experts to stay on top of the literature beyond their narrow field of expertise. Using text-mining and automatic text processing tools, can help these scientists, researchers and policy-makers tame the amounts of text and extract useful information from them.
When this project started, systems like ChatGPT were not available, and the initial phases developed systems that worked in the pre-GPT world. These systems proved to be very effective also with the post-GPT world and obtaining results which AI chatbots cannot match, but user's expectations (including expert users expectations) regarding how "intelligent" systems can and should behave, and especially how they as users should interact with these systems: while pre-GPT users were willing to invest in learning to operate specialized systems to solve their needs, post-GPT they expect to be able to communicate freely via natural language and the system to "understand what they want and just work". GPT-like systems allow us to do much more in terms language understanding and text-mining, but users also evolved to demand much more---more than what the technology can provide.
Thus, in the second half of the project duration, we adapted our objectives to investigating how we can accomodate the initial objective (exposing advanced text-mining techniques to domain experts) in a post-GPT world: how can we most effectively use the new technology to bridge the gap between general-purpose language-based AI systems, and large and specialised text collections, while remaining accurate and trustworthy.
The SPIKE tool is accessible at: https://spike.staging.apps.allenai.org/(opens in new window)
The SPIKE system was used by researchers in non-CS domains to curate specialized datasets in plant-sciences, nano-technology-based cancer treatment, multi-drug personalized cancer treatment, and uncovering the long-tail of medical etiologies ("which medical conditions can cause this set of symptoms"). For the last item (medical etiologies), we specifically measured how the responses obtained by the SPIKE-based system compares to results obtained by advanced AI systems (GPT). We show that the coverage obtained by the system is significantly larger, with a large set of etiologies not found by GPT.
However, the availability of GPT-like systems changed users expectations, and they became reluctant to learning a system like SPIKE. We therefore shifted our efforts towards investigating how we can make the best use of large-language-models like GPT and the related set of technologies we bring with them, such as semantic search. We investigated new ways to perform search, to be integrated in a SPIKE like system, for example how to search texts by descriptions of entities contained in them (e.g. "Find me paragraphs in this collection that mention a lung disease together with a possible treatment", a query which is very hard to perform in traditional search engines). Here, we created a new search task and a benchmark for it, as well as text-retrieval models for performing the task. We also developed systems and methods that use large-language-models to read in, analyze and organize large text collection (for example, taking a thousand of research papers in a topic, or all the papers published in some conference, and organizing them into a navigable topical hierarchy which is inferred from the data).
In addition to these results, and in the process of working towards them, we also investigated the boundaries of what large-language-models can and cannot achieve reliably. For example we studied the amount of text they can understand and use effectively at a single call, and investigated questions such as how well do large-language-models integrate internal and external knowledge. These studies established that the promise is much greater than what they can deliver. This also ties to our work on establishing a framework for human-AI interaction, in which we establish what it means for a user to trust an AI system, what makes a system trustworthy, and the requirements that needs to be met for achieving such systems.