Skip to main content

Domain-centric Intelligent Automated Data Extraction Methodology

Final Report Summary - DIADEM (Domain-centric Intelligent Automated Data Extraction Methodology)

It is the age of big data ... Google, Amazon, Twitter, Walmart, all drive their business more and more through huge amounts of data about their customers, their competitors, or the market in general. But who has the ability to collect this kind of data apart from big, tech-savvy businesses? Isn’t it just another way for the established players to increase their competitive advantage?
We believe that it doesn’t have to be that way: You think you have an idea how to better match people and jobs? Or how to answer “what’s the best Italian restaurant” in Oxford? But you don’t know where to get the necessary data on current job offers, restaurant menus, or other product offers. That’s where DIADEM comes in: We are building a fully automated extraction engine that can quickly extract a highly structured database of all offers or goods you are interested in, whether they come from a few websites or hundreds of them—a database that enables better search, better recommenders, or better analytics. Extracting this data from retailers all over the web not only benefits the users by giving them a complete view of the market, but avoids placing increasing power in the hands of few, monopolized market places. Rather than paying for access to these market places, retailers can continue to publish their offers on their websites and DIADEM will automatically pick them up.
What Freebase or DBpedia are to Wikipedia, DIADEM is to the web of products, events, and other dynamic objects: Wikipedia has grown to be the most extensive source of common-sense knowledge (“who is the 54th prime minister of Italy”) and DBpedia or Freebase provide this information in a structured form for automatic processing. DIADEM does the same for the web of objects, turning hundreds of thousands of web sites containing billions of offers into a structured, searchable database. But how does DIADEM go about extracting this data from such a wide variety of web pages? Metadata (schemas, ontologies, or sample instances) is key to object extraction and search. This has always been a credo in web data extraction, yet never been tested, as manual collection of such metadata has been seen as prohibitively expensive. This has changed with DIADEM: We have developed an object extraction system that exploits extensive metadata about the relevant objects (in form of both a schema and sample instances). With this approach we outperform existing semi-supervised and unsupervised approaches by a wide margin (> 95% accuracy on a wide range of domains and sites) for all relevant tasks in fully automated data extraction: exploration, form understanding and filling, and analysis of the objects to be extracted.
DIADEM has achieved this combination of unprecedented automation and accuracy in a broad set of domains, from real estate over restaurants to electronics, and in multiple countries. It has been evaluated extensively by external evaluators under the auspices of a major US technology company that have fully confirmed the accuracy and scalability of the approach.
This achievement was made possible by a combination of many individual scientific breakthroughs, including an extraction language, called OXPath, that is easy to use and requires significantly less resources compared to previous approaches; an extraction engine driven by a novel, integrated knowledge base of facts about web pages and a novel, highly expressive, yet tractable reasoning paradigm based on Datalog±, the ontological knowledge representation language developed in DIADEM. These achievements have been published in top international journals and conferences and been acknowledged by several awards and research grants from major Internet companies. In recent months, DIADEM has proven to be highly adaptable to new domains. Within less than one month a three person team was able to set up the system for extraction of the locations of most of the US restaurants from hundreds of thousands of sources. And again, the data was highly accurate and contained less errors than a crowd-sourced comparison dataset.
The future is bright for DIADEM: Wrapidity Ltd, a recent spin-out of Oxford University, will bring this technology into commercial applications together with strong industry partners from many countries and sectors. At Oxford University, research on further automating data acquisition will continue, now with the focus on the entire pipeline from data extraction over data cleaning and integration to quality assessment. If each component of the pipeline provides some "context" about its view of the data to the other components, these components can automatically adjust itself to produce even better data over time. This research will take place as part of the €7M+ VADA research grant lead by DIADEM.