Deep Gaussian Processes for Geostatistical Data Analysis

Project Information

DeepGeo

Grant agreement ID: 800581

Project website

DOI

10.3030/800581

Project closed

EC signature date 10 April 2018

Start date 23 July 2018

End date 22 July 2020

Funded under

EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions

Total cost

€ 183 454,80

EU contribution

€ 183 454,80

183 454,80

Coordinated by

THE CHANCELLOR MASTERS AND SCHOLARS OF THE UNIVERSITY OF CAMBRIDGE
United Kingdom

Periodic Reporting for period 1 - DeepGeo (Deep Gaussian Processes for Geostatistical Data Analysis)

Reporting period: 2018-07-23 to 2020-07-22

As cities expand, old industrial sites are repurposed for residential and recreational areas. Such sites, however, can be severely polluted, and discovering the exact location and composition of the pollution is very important for public health and wellbeing. Currently, taking soil samples from polluted sites is expensive and time-consuming, and one typically only looks for certain chemicals that are defined by the authorities (such as the EU). Over time, the list of chemicals to look for is updated, but once soil from a site has been analysed, it is discarded, making it impossible to reassess it according to the new guidelines. The overall objective of this project was to take advantage of artificial intelligence to make pollution assessment easier, cheaper, and more accurate, and to make it possible to discover new types of pollution at sites sampled years ago.

During the project, we showed how we can push current state-of-the-art methods to their fullest. We also developed new methods that have a better understanding of when they are uncertain about a prediction, as well as methods that would allow us to discover new types of pollution by taking advantage of correlations with known types of pollution. Unfortunately, we also discovered that it is not possible to model soil pollution accurately with the quality of the soil samples that are being collected today due to regulatory requirements. The samples are taken too far apart, leading to blind spots between them, meaning that we potentially miss pollution hotspots.

During the project, we investigated how state-of-the-art methods, which can be used to model pollution, respond to changes in how they are constructed. Through these studies, we found ways to construct these models that make them more accurate without requiring more computations.
We also discovered that most AI methods assume that the input we give them is precisely known. This is not always the case - for instance, we may not know exactly where on an old industrial site a soil sample was collected - and not taking this uncertainty into account can lead to overly confident predictions of pollution. By developing a new way to include this uncertainty, we created a new method that has a much better understanding of how confident it should be about a certain prediction of pollution.
We then looked at how we could get an AI system to not only learn to model the pollution by many different chemicals at once but also how the amount of these chemicals relate to each other. By combining two popular AI methods, we developed a new hybrid method that makes it possible to predict the amount of a chemical much more accurately if other chemicals have been measured nearby.
Finally, we discovered that predicting soil pollution using AI or any other statistical method is not possible with today's data on soil pollution. This is because soil samples are collected too far apart at polluted sites, leading to blind spots between the samples that could, potentially, be heavily polluted. The required distance between soil samples are determined by law, so we will draft new proposals for these regulations and inform policymakers about the issue.

Through our study of state-of-the-art methods, we now know better how to construct these to get the most accurate results with the least computational resources. This is not only true for pollution but for many different types of data, making our discovery very widely applicable.
Our work on including uncertainty in AI methods is also very general and can be used in many different situations. When it comes to pollution, this means that methods will have a much better understanding of how confident they should be about their predictions. We can, therefore, avoid having methods that claim that there is only very little pollution, when in fact there is a lot, just because they did not consider how precisely we knew their inputs.
Finally, we developed a new method that can not only model many different types of pollution at once, it can also find much more complicated correlations between these than previous methods, and take advantage of these correlations. For example, we can train the method to find correlations between previously measured pollution and new types of pollution. We can then predict how much of the new pollution there is anywhere we have measurements of the old type without having to go out and obtain new, expensive samples.

We also discovered, unfortunately, that it is not possible to use AI or any other statistical method to predict soil pollution based on the samples we are collecting today. We are simply taking soil samples too far apart, which has large societal implications, as we may miss hotspots of pollution. By drafting new guidelines for soil sampling based on the findings in this project, we hope to turn the policymakers' attention to this problem.

It may not be possible to use AI to predict soil pollution because of the current regulations, but the methods we developed are completely general and can be used for many other types of data. For instance, they can be used to predict air pollution or to help in modelling climate change.

Multi-output model predicting the underlying function of the second dataset without any data.

Periodic Reporting for period 1 - DeepGeo (Deep Gaussian Processes for Geostatistical Data Analysis)

Share this page Share this page on social networks

Download Download the content of the page