Skip to main content

EXtreme-scale Analytics via Multimodal Ontology Discovery & Enhancement

Periodic Reporting for period 1 - EXA MODE (EXtreme-scale Analytics via Multimodal Ontology Discovery & Enhancement)

Reporting period: 2019-01-01 to 2020-06-30

Exascale, heterogeneous, multimodal structured and unstructured data are produced every day in many different domains. Healthcare data production is constantly increasing (2’000 exabytes/year in 2020) and these data are outstanding because of size, included knowledge and potential commercial value. Several limits prevent the extraction of knowledge and value from clinical data, such as:
• the difficulty to obtain large cohorts of data annotated by specialized medical doctors;
• the data heterogeneity due to the variability of clinical data acquisition workflows;
• the variability of diagnostic approaches.
ExaMode aims at solving the problems mentioned above by allowing easy and fast, weakly supervised knowledge discovery of exa-scale heterogeneous data, also in highly specific domains such as the medical sciences. The weakly supervised knowledge discovery system is planned to allow reducing the interaction with experts required by data annotations to develop new methods to approach data heterogeneity and to link text concepts and visual patterns. The overall ExaMode objective is planned to be reached via three measurable objectives towards which the consortium is advancing:
1. Weakly-supervised knowledge discovery for exascale medical data.
2. Development of extreme scale analytics tools for heterogeneous, multimodal and multimedia exascale data.
3. Adoption of extreme-scale analysis and prediction tools by healthcare and industry decision-makers.
Thanks to the fruitful collaboration of the partners involved, ExaMode advanced fast during its first 18 months, achieving the planned results (that include the multimodal data, publicly available software and product prototypes).
The ExaMode consortium is composed of seven partners, that are collaborating actively and harmoniously towards the project objectives by sharing ideas, plans and resources. The partners can be divided into three groups plus a provider of High-Performance Computing (HPC) competences and resources. Clinical partners (Azienda Ospedaliera per le Emergenze Cannizzaro - IT, and Radboud University Medical Center - NL) focused on the provision of clinical data and annotations, dealing with ethics and privacy requirements and on providing clinical knowledge (to guide academic partners in research and companies towards the development of useful products). Academic partners (University of Applied Sciences Western Switzerland - CH, University of Padova - IT, and Radboud University Medical Center - NL) worked in strong collaboration on developing tools for knowledge discovery and utilization (for instance, to extract text entities from diagnostic reports and scientific literature in several languages, to reduce the variability between images recorded in different centers, or to learn visual representations of concepts from few associated samples). Industrial partners (MicroscopeIT - PL, and SIRMA AI - BG) focused on understanding the needs of clinical partners, ideating new products (based on clinical requirements and on the research by the academic partners), and on defining and developing early prototypes. The provider of High-Performance Computing (HPC) competences and resources (SURFSARA - NL) focused on providing the consortium with access to cloud, storage and supercomputing infrastructures, and in actively collaborating on the development of the research tools and product prototypes.
The work performed by the consortium during the first 18 months became concrete in numerous technical achievements and in valuable scientific publications, providing solid foundations for the future of the project and its exploitation.
The main technical achievements obtained by ExaMode can be summarized as follows:
1. the ExaMode multimodal datasets & annotations;
2. the multimodal knowledge discovery tools;
3. the multimodal knowledge exploitation product prototypes.
The datasets includes over 30'000 images and associated text reports, targeting the four ExaMode use cases: colon cancer, uterine cervix cancer, lung cancer and celiac disease. Over 10'000 images are high resolution microscopy images from clinical practice that are associated to the anonymized clinical reports; the remaining data are selected from publicly available sources and from the scientific biomedical literature. Multimodal knowledge discovery tools include software to relate text entities with ontologies, a new ontology for colon cancer, early tools to deal with color and scale variability, to separate multi-panel images, a tool to safely & automatically transfer data from hospitals to data centers, algorithms for semantic segmentation and detection in histopathology images, and self-attention models, weakly supervised and active learning workflows. Publicly available code is listed in a dedicated section on the project website.
Multimodal knowledge exploitation prototypes include two main prototypes. VIRTUM ALBUM, by Microscopeit, is a cloud-based software to manage, store, and annotate histopathological slide collections, targeting individual and institutional users. The prototype was tested by pathologists and represented a useful tool for advancing in the project.
SIRMA AI developed an open access demonstrator of the main components and services of the adaptive system prototype and released a video-recording of the demo, that also integrates the colon ontologies developed within ExaMode. The consortium published a large number of scientific publications (5 published journal publications, 18 published conference publications, 5 publications accepted, 9 submitted).
The difficulty to obtain data annotations by specialists, the variability of acquisition workflows and diagnostic approaches prevents companies from releasing ground-breaking products and medical doctors from seeing their workload reduced. Moreover, sharing, managing, accessing and transferring medical data requires extremely efficient methods due to their size. Novel approaches are thus required to allow their efficient management.
ExaMode solves the mentioned problems by developing a knowledge discovery system to extract, link and retrieve multimodal information from highly heterogeneous and unstructured data, like in this case medical data. During ExaMode, the consortium is developing a whole set of new methods and concepts for extreme scale analytics. Open source software is being created, improved, released and tested to perform knowledge discovery in exascale digital pathology data.
Potential impact of the project includes the development of new methods to train deep neural networks from heterogeneous data faster, and the creation of systems allowing to handle extreme scales of multimodal data with less effort, increasing speed of data throughput and data access. ExaMode aims at the adoption of the results in industry (starting by the project partners), leading to a potential positive impact on society (starting from the application of the prediction tools in decision making by the clinical partners).
The weakly supervised knowledge discovery system and the multimodal and multimedia ontologies can empower decision support tools for digital pathology (a large and growing market, expected to value 760MUSD in 2024), scientific research, and can be applied to other diagnostic imaging sectors, allowing the EU to reach a leadership position in these fields.