Skip to main content
European Commission logo print header

Probabilistic modelling of electronic health records

Periodic Reporting for period 2 - PMOHR (Probabilistic modelling of electronic health records)

Período documentado: 2018-10-01 hasta 2019-09-30

Machine learning is a field of computer science with the aim of discovering statistical patterns in large datasets and making predictions using these patterns. While there is a growing interest on methods that can provide accurate predictions on unseen data, comparatively less effort is invested on designing interpretable machine learning models. Interpretable models have the benefit to provide human-understandable patterns that can be interpreted by domain experts. This enables knowledge discovery in the scientific disciplines, and it also allows reasoning about causality and making counterfactual predictions.

In probabilistic machine learning, we first encode our assumptions about the data structure in the form of a model that has latent variables, which represent the hidden patterns. We then learn the latent variables using an inference algorithm. The results of the inference procedure can be used to make predictions and explore the data collection. Crucially, we can impose an interpretable structure in the model design phase. Finally, in probabilistic machine learning we can also apply model testing approaches to verify the properties of the posited model and whether if fails capture certain properties of the data.

PMOHR is an interdisciplinary project focused on the design of interpretable models through probabilistic machine learning, with the ultimate goal of modelling Electronic Health Records (EHRs). Applying machine learning tools to EHR data can help not only design clinical support systems, but it may also lead to uncover unknown patterns from the data and even form causal theories.

However, medical data, and EHRs in particular, present several challenges that prevent us from applying probabilistic modelling tools, because the datasets are large and heterogeneous. The objectives of PMOHR are to develop both probabilistic models and inference algorithms that are suitable for modelling EHR data. This new set of tools can then be applied to make predictions and to analyse medical datasets. In particular, PMOHR uses both publicly available EHR data and EHR data from the New York Presbyterian Hospital.
PMOHR comprises the three main steps of the probabilistic modelling pipeline: model design, scalable inference algorithms, and model testing. The project has achieved a number of results towards its objectives. These results are described below.

In terms of modelling, PMOHR has achieved two main results. The first one is a new class of models, called exponential family embeddings, which can capture co-occurrence patterns of objects in a dataset. When applied on EHRs, exponential family embeddings can capture how medical diagnoses relate to each other and can also find meaningful features of the diagnoses, which are useful for downstream analyses. Exponential family embeddings can be applied on high-dimensional discrete data, such as medical text or medical diagnoses, as well as on real-valued data, such as neural activity.

See the attached figure for an example of the results of exponential family embeddings when applied to the publicly available MIMIC-III dataset. The figure shows a projection of the features found by an exponential family embedding model applied on the diagnoses in the publicly available MIMIC-III dataset. On the first figure, each dot corresponds to a disease, and colours indicate the categorisation of the diseases. Exponential family embeddings identify clusters of similar diseases based on their co-occurrence patterns, even though the ontology information was not provided to the algorithm. On the second figure, we can see a zoom on a particular location of the first figure. Exponential family embeddings find a cluster of related respiratory conditions, even when they belong to several groups. The embedding representation found by the model can be used both for making predictions but also as features for further analyses.

The second modelling result is a new model for the analysis of longitudinal datasets, based on Bayesian non-parametric techniques. This model allows us to model the dynamics of time series (such as the latent health of a patient over time). Importantly, the complexity of the model adapts automatically to the available data, following the structured posited in the model design phase.

In terms of inference, PMOHR has extended state-of-the-art inference algorithms to obtain faster and scalable algorithm that are suitable for virtually any probabilistic model. This is crucial for algorithms to scale for complex models and large EHR datasets. The developed inference methods are black-box, in the sense that they are general-purpose algorithms that can be applied off-the-shelf.

In terms of model testing, PMOHR has developed a novel method for testing probabilistic causal models. The analysis of EHRs requires to posit causal theories, many of which are fundamentally not testable. However, there are some other modelling assumptions for which the model testing phase can still provide insights about possible model mismatch. PMOHR extends the tools of Bayesian model testing to study the validity of probabilistic causal models.
So far, PMOHR has arguably advanced the state of the art of probabilistic modelling through the development of general complex models that can handle different types of data. Moreover, PMOHR has also advanced the field of Bayesian inference through the development of algorithms that speed up existing methods. It has also contributed to the field of probabilistic causal modelling through the development of scalable models and testing tools.

The long-term goal of PMOHR involves improving health systems through the design of clinical support systems. Health systems play a central role in modern societies. Besides being a value in itself, health is also a precondition for economic prosperity: people's health influences economic outcomes in terms of productivity, labour supply, human capital and public spending. Furthermore, the healthcare sector has major economic significance since it represents 10% of the EU's GDP; in fact, the healthcare personnel accounts for nearly 10% of all job positions in the EU. In this context, personalised medicine and clinical support systems have the potential to improve overall health while providing significant savings. Given the typical time lag between research and deployment of a final product, we believe that PMOHR is a timely project and that its advances bring us one step close towards these long-term goals, since it will contribute to the machine learning and statistical sciences in the short term, and to the healthcare and economy in the long term.
plot2.jpg
plot1.jpg