Skip to main content

Artificial Intelligence for Air Quality

Periodic Reporting for period 2 - IntelliAQ (Artificial Intelligence for Air Quality)

Reporting period: 2020-04-01 to 2021-09-30

Status 31 May 2021

Artificial Intelligence (AI) is experiencing a wave of enthusiasm, since ground-breaking results have been published on cognitive problems such as image and speech recognition, automated language translation, robotics, and strategic games. This has become possible because of recent advances in massive data processing capabilities (“Big Data”) and the development of deep learning (DL) machine learning architectures, which “learn” millions of parameters. Even though machine learning in general has been in use for many years, the uptake of DL in environmental science has been slow and IntelliAQ is one of the first projects in the atmospheric sciences to fully embrace the potential of modern big data processing and DL. IntelliAQ aims at shifting the analysis of global air pollutant observations to a new level and will provide a basis for the future development of innovative air quality services with robust scientific underpinning.
Air quality is determined by several factors including air pollutant emissions, chemical transformations, transport processes, and weather. To analyse and understand air quality data and assess changes in air pollution levels, all of these factors must be taken into account. Air pollutant concentrations exhibit complex, time-dependent spatial patterns. Therefore, complex DL architectures and comprehensive datasets are needed when we want to use AI for the analysis of air quality and build air quality predictions based on modern machine learning. Specifically, IntelliAQ has three main objectives (see also Figure 1):
1. to develop novel spatial and temporal interpolation methods using deep neural networks in order to expand the coverage of historic and recent data while preserving fine-scale structures down to the street level,
2. to develop an innovative air quality forecasting concept based on deep learning,
3. to explore the use of deep neural networks to assess the quality of air pollution data and establish new, robust techniques for automated outlier detection and data screening.

Figure 1: The data fusion and machine learning concept of the IntelliAQ project (from Schultz et al., 2020).

In addition to the main objectives the project aims for
- Expanding the collection of global surface air pollution measurements, especially ozone, thereby expanding the well-recognized database of the Tropospheric Ozone Assessment Report (TOAR) which is hosted at the Jülich Supercomputing Centre,
- Contributing to the development of new deep learning techniques through the application and analysis of strengths and weaknesses of current methods and neural network architectures.
IntelliAQ combines fundamental research on deep learning and atmospheric science with the development of cutting-edge technology and software with a strong commitment to Open Science and Open Data. The project is close to delivering a unique data infrastructure which allows the seamless combination of local air pollution measurements with high-resolution earth observation data of the land surface and with many years of weather data from leading meteorological centres. The critical evaluation of the deep learning results constitutes an important cornerstone of IntelliAQ and lays a foundation for trustworthy artificial intelligence for environmental data analysis.
Status 31 May 2021

The major achievements in the IntelliAQ project fall into the two categories (1) building a new data infrastructure and (2) exploring various deep learning techniques and model architectures.

The IntelliAQ data infrastructure is based on the well-recognized database and web services of the Tropospheric Ozone Assessment Report (TOAR; Schultz et al., 2017). Work in the project has so far:
• extended the database to include the storage of air pollution concentrations other than ozone and meteorological data,
• redesigned the database structure to enhance the database FAIRness and allow for better documentation and provenance tracking (Schröder et al., 2019, 2020; Mozaffari et al., 2020; Schröder et al., manuscript in preparation),
• investigated the performance of the TOAR database and explored various options to improve the performance of online analytics (Betancourt et al., 2020, 2021a),
• created comprehensive documentation of the database and the associated web services with the goal to obtain a Core Trust Seal certification (submission of documents planned for August 2021),
• created web services for interfacing deep learning applications with the TOAR database,
• developed a web service for automated calculation of ozone-vegetation fluxes, which determine plant damage, based on the TOAR database,
• built a suite of web-based data services for various geospatial datasets (example in Figure 2) which are used to characterize the location and environment of air quality monitoring stations globally in a homogeneous manner (Schultz 2017),
• developed and implemented largely automated work flows to ingest new data into the TOAR database, in particular the large collection of air quality data from the OpenAQ initiative (a subcontract was given to OpenAQ to enhance their web services and metadata for this purpose),
• developed novel modular software for automated quality control of atmospheric data based on state-of-the-art statistical methods (Kaffashzadeh et al., 2019a, 2019b, 2020).

Figure 2: Image of nighttime lights from the Defense Meteorological Satellite Program of NASA as retrieved from the geospatial data service implemented in the context of IntelliAQ and TOAR.

Concerning the exploration of deep learning methods and architectures the work has so far focused on three main lines of research:
• forecasting of air quality time series with deep neural networks based on inception blocks (Schultz, 2019; Kleinert et al., 2021); two doctorate and two master theses on this topic are in preparation investigating the role of air mass transport on the forecast quality (doctorate thesis by F. Kleinert), the potential of applying time filters to better capture slow and fast variations in time series forecasts (doctorate thesis by L. Leufen), the generalisation of forecasts across variables, i.e. multi-purpose learning (master thesis by F. Weichselbaum), and the use of classifiers to better capture extreme events in air quality forecasts (master thesis of V. Gramlich). Preliminary work on extreme values was presented by Gong et al. (2019). A user-friendly, adaptable software tool for time series forecasting through deep learning was developed and published together with the source code (Leufen et al., 2021),
• forecasting of meteorological data with video prediction methods (Gong et al., 2020; and several manuscripts in preparation); to circumnavigate the problem of missing data values and get to know different state-of-the-art deep learning techniques for video prediction, we decided to look at 2-metre temperature predictions from numerical weather model fields as a first target. This involved the development of terabyte-scale data handling routines and parallel deep learning codes which can run efficiently on the Jülich supercomputer systems (Kesselheim et al., 2021). Since the start of the project considerable progress has been made in the field of video prediction and we decided to adopt Stochastic Adversarial Video Prediction (SAVP, Lee et al., 2018) as the main target architecture due to superior performance. SAVP results outperform the forecast results from other methods such as Convolutional LSTM, variational auto-encoders, or generative adversarial networks. Evaluation of the forecast results has to follow standards set by the meteorological community, which differ from quality scores used in the video prediction community. A master thesis has been completed on the topic of video prediction for meteorological data (S. Hußmann, 2019).
• Interpolation of air quality data with guidance by geospatial data (Betancourt et al., 2021b); contrary to classical statistical and geospatial interpolation approaches (e.g. kriging, inverse distance weighting or radial basis functions) we are exploiting the information from a variety of geophysical data sources (e.g. climatic zone, population density, land cover type, etc.), which determines a large part of the air quality statistics at a given location. Classical machine learning methods have been employed to map ozone concentration statistics from the thousands of measurement sites in the TOAR database to arbitrary locations on Earth. Explainable AI methods developed by Meyer and Pebesma (2020) have been used to ensure the robustness of results and determine the area of applicability. The data and methods used in this activity have been provided as open source benchmark (Betancourt et al., 2021b), thereby enabling machine learning scientists to develop their skills on a well-defined problem set with environmental data. C. Betancourt will finish her doctorate thesis on this topic in 2022.

Examples from the work described above are reproduced in Figure 3 below.

Figure 3: Examples of deep learning results on air quality and meteorological data. Left: Seasonal distributions of original near-surface ozone measurement values and neural network forecasts over 1-4 days. The forecasts have only small biases, but tend to converge toward the climatological mean with longer lead times (from Kleinert et al., 2021). Center: mean square error of 2-metre temperature predictions over Europe for a simple persistence model and two deep learning architectures from video-prediction. The more complex SAVP architecture consistently outperforms the two reference models (from Gong et al., manuscript in preparation). Right: area of applicability for the spatial mapping of annual ozone concentrations from thousands of measurement sites around the world (from Betancourt et al., manuscript in preparation).

Recently, a new doctorate student has joined the team who will explore the potential of physics-informed neural networks (e.g. Raissi et al., 2019) for meteorological and air quality forecasting. This concept reduces the need for excessively large data samples and bears some promises to increase the robustness of the deep learning results. Work on deep learning methods for data quality control has not started yet as this requires the successful completion of the interpolation and forecasting tasks.

It is worth to point out that IntelliAQ has commenced in a freshly founded research group of 4 people. In the meantime the team has grown to a diverse group 20 people (7 of which are paid through the project), because the initial work in the project has opened up several new funding opportunities. Two major national grants (DeepRain and KI:STE) and a new H2020 EU grant (MAELSTROM) could be secured as a result of IntelliAQ and these projects provide synergies with IntelliAQ in terms of the development of software and deep learning methods, while covering different, but related aspects of environmental science.

IntelliAQ work has been presented at several conferences and featured several times as highlight presentation or invited talk. An overview article about the project appeared in The Project Repository Journal (Schultz, 2020). The thorough investigation of deep learning challenges in the context of environmental data has led to a thought-provoking article in Philosophical Transactions (Schultz et al., 2021), which was downloaded over 8000 times so far.
Status 31 May 2021

As one of the first projects tackling air quality issues with deep learning methods, IntelliAQ had to overcome a number of initial challenges. A review of the machine learning literature published until 2017 revealed that many over-optimistic results got published, because of subtle flaws in the applied methods and evaluation metrics. In particular aspects of data preparation, which typically assume independent and identically distributed (i.i.d.) data in common machine learning applications, needed to be revisited as environmental data is almost always auto-correlated and therefore not suitable to random sampling (Kleinert et al., 2021; Schultz et al., 2021). The development of proper data preparation methods and suitable evaluation scores adopted from the meteorological community, in itself represents a considerable progress beyond the state-of-the-art. Furthermore, Kleinert et al., 2021 could demonstrate improved generalisation skills of their neural network compared to other attempts of time series forecasting, which we attribute to the use of a more complex network and a much larger dataset for training and evaluation. The application of video frame prediction methods to meteorological data also was a novelty when we first tried it. While similar studies were then conducted by other groups, primarily in China, we have been the first to apply the recent SAVP architecture to meteorological forecasting. The preparations for this took some time due to the need for developing efficient parallel data preparation and training software. Also, in the interpolation task, IntelliAQ has moved the boundaries of research as AQBench (Betancourt et al., 2021) was the first study to successfully relate a set of geophysical variables to air quality statistics with the help of machine learning. The adoption of explainable AI concepts from the field of remote sensing analysis is also a novelty in air quality research.

The IntelliAQ data infrastructure is quite unique in its quest to actually implement a fully FAIR architecture of a database and associated web services. While there are many data archives around which provide FAIR access to data in the form of archived files, there are only very few attempts to provide FAIR services and data analytics directly from the database. Progress beyond state-of-the-art is achieved through a unique combination of several state-of-the-art concepts and through the development of custom-tailored services specifically for the TOAR community. Online calculation of air quality metrics and ozone flux diagnostics and an automated data quality control tool which will be made available to users as a service are perhaps the most prominent novel and unique developments of our data infrastructure. Finally, a strong focus on good software development practices and the full dedication to Open Science and Open Data have led to reusable tools and datasets which set new standards in the field of environmental machine learning.

One expected outcome of the project is the provision of enhanced data analysis tools and services to the TOAR community. This work is well on track and we are confident to fulfil all project objectives. The next steps with regard to the data infrastructure development include the consolidation of the new data infrastructure including release and publication. Regarding the expected outcomes of deep learning as the main focus of the project we are optimistic to reach most project objectives. The time series forecasting shall be integrated with the geospatial mapping (interpolation) work to produce time-resolved interpolated maps of air pollutant concentrations. While the focus so far has been on ozone as the main air pollutant addressed in the TOAR, we also plan to apply our methods to other air pollutants (particulate matter and nitrogen oxides). The video prediction methods are currently under evaluation and it is not clear yet how well they can be used to predict air pollutant concentrations and if it will be possible to deal with the missing value problem when using measurement data from air quality stations. Through cross-fertilisation with the other research projects in our group we expect to make further significant progress in the application of deep learning to air quality data (and atmospheric data in general), but due to the rapid developments in this research area it is very hard to predict where this will take us and what exactly will be possible until the end of IntelliAQ. As discussed in Schultz et al. (2021) the application of machine learning to environmental data poses a number of specific challenges which are not on the top of the agenda of the key groups developing new machine learning methods. The Covid-19 pandemic has made it very difficult to connect to such groups so that our impact on the deep learning method development may be smaller than originally expected. On the other hand, IntelliAQ has placed our group among the top players in environmental machine learning and the new collaborations established for example in the MAELSTROM project might provide an additional boost to our activities and results. In any case we are confident that the main project objectives can be accomplished and that we will have a much better understanding of the potential and limitations of deep learning in air quality research and management.

Betancourt, C., Schröder, S., Hagemeier, B., Schultz, M.G. (2020): Performance analysis and optimization of a TByte-scale atmospheric observation database [virtual display EGU2020-13637]. European Geophysical Union Assembly 2020, online event, 04-08 May 2020.
Betancourt, C., Hagemeier, B., Schröder, S., Schultz, M.G. (2021): Context aware benchmarking and tuning of a TByte-scale air quality database and web service, Earth Sci Inform (2021).
Betancourt, C., Stomberg, T., Stadtler, S., Roscher, R., and Schultz, M. G.: AQ-Bench: A Benchmark Dataset for Machine Learning on Global Air Quality Metrics, Earth Syst. Sci. Data Discuss. [preprint], in press, 2021b.
Gong, B., Schultz, M.G. and Kleinert, F. (2019): Prediction of daily maximum ozone threshold exceedances by preprocessing and ensemble artificial intelligence techniques [oral EGU2019-10543]. European Geophysical Union Assembly 2019, Vienna, Austria, 07-12 April 2019.
Gong, B., Hußmann, S., Mozaffari, A., Vogelsang, J., and Schultz, M. (2020): Deep learning for short-term temperature forecasts with video prediction methods [virtual display EGU2020-17748]. European Geophysical Union Assembly 2020, online event, 04-08 May 2020.
Hußmann, S. (2019): Deep Learning for Future Frame Prediction of Weather Maps, Master thesis, Humboldt University Berlin.
Kaffashzadeh, N., Schröder, S., and Schultz, M.G. (2019a): A novel concept for Automated Quality Control of Atmospheric Time Series [poster EGU2019-14392]. European Geophysical Union Assembly 2019, Vienna, Austria, 07-12 April 2019.
Kaffashzadeh, N., F. Kleinert and M.G. Schultz (2019b): A New Tool for Automated Quality Control of Environmental Time Series (AutoQC4Env) in Open Web Services. In: Abramowicz W., Corchuelo R. (eds) Business Information Systems Workshops. BIS 2019. Lecture Notes in Business Information Processing, vol 373. Springer, Cham.
Kaffashzadeh, N., Chang, K.L. Schröder, S., and Schultz, M.G. (2020): A Statistical Model for Automated Quality Assessment of the TOAR-II [virtual display EGU2020-13357]. European Geophysical Union Assembly 2020, online event, 04-08 May 2020.
Kesselheim, S., Herten, A., Krajsek, K., Ebert, J., Jitsev, J., Cherti, M., Langguth, M., Gong, B.,
Stadtler, S., Mozaffari, A., Cavallaro, G., Sedona, R., Schug, A., Strube, A., Kamath, R., Schultz, M.G. Riedel, M., Lippert, T. (2021): JUWELS Booster: A Supercomputer for Large-Scale AI Research. ISC High-Performance Conference Digital, 24 June – 02 July, 2021 (accepted paper).
Kleinert, F., Gong, B., Götz, M., and Schultz, M.G. (2019): Near Surface Ozone Predictions Based on Multiple Artificial Neural Network Architectures [oral, EGU2019-12541]. European Geophysical Union Assembly 2019, Vienna, Austria, 07-12 April 2019.
Kleinert, F., Leufen, L. H., and Schultz, M. G. (2021): IntelliO3-ts v1.0: a neural network approach to predict near-surface ozone concentrations in Germany, Geosci. Model Dev., 14, 1–25,
Lee AX, Zhang R, Ebert F, Abbeel P, Finn C, Levine S. (2018): Stochastic adversarial video prediction. (
Leufen, L. H., Kleinert, F., and Schultz, M. G. (2021): MLAir (v1.0) – a tool to enable fast and flexible machine learning on air data time series, Geosci. Model Dev., 14, 1553–1574,
Meyer, H. and Pebesma, E. (2020): Predicting into unknown space? Estimating the area of applicability of spatial prediction models, arXiv:2005.07939v1.
Mozaffari, A., Schröder, S., Apweiler, S., Saini, R., Hagemeier, B., Schultz, M. (2020): FAIRness in the multi-services data infrastructure of the Tropospheric Ozone Assessment Report (TOAR) and Artificial Intelligence for Air Quality (IntelliAQ) project. RDA 15th Plenary Meeting, Melbourne, Australia and virtual, 18-20 March 2020. Accessible at Last accessed: 28 May 2021.
Raissi, M., Perdikaris, P and Karniadakis, G.E. (2019): Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, J. Comp. Phys. 378, 668-707.
Schröder, S., Apweiler, S., Saini, R., Hagemeier, B. Schultz, M.G. (2019): Enhancing FAIRness of global air quality data: The Tropospheric Ozone Assessment Report database [oral]. GeoMünster conference, Münster, Germany, 22-25 Sept 2019.
Schröder, S., Mozaffari, A., Apweiler, S., Saini, R., Hagemeier, B., Schultz, M.G. (2020): FAIRness in the multi-service data infrastructure of the Tropospheric Ozone Assessment Report (TOAR) and Artificial Intelligence for Air Quality (IntelliAQ) project [poster]. RDA Germany Conference, Potsdam, Germany, 25-27 February 2020.
Schultz M.G. and 96 co-authors (2017): Tropospheric Ozone Assessment Report: Database and Metrics Data of Global Surface Ozone Observations. Elem. Sci. Anth., 5:58,
Schultz, M.G. (2019): IntelliAQ and DeepRain: Using Deep Learning Approaches in Weather and Air Quality Forecasts [oral]. Workshop on Machine Learning in Weather and Climate Research, Oxford, UK, 02-05 Sept 2019.
Schultz, M.,(2020): Artificial Intelligence for Air Quality,The Project Repository Journal (PRj), Vol. 6, p. 30-32, June 2020. Accessible at last accessed: 29 May 2021.
Schultz M. G., Betancourt C., Gong B., Kleinert F., Langguth M., Leufen L. H., Mozaffari A. and Stadtler S. (2021): Can deep learning beat numerical weather prediction? Phil. Trans. R. Soc. A.3792020009720200097.
Figure 3: Examples of deep learrning.....
Figure 1: The data fusion and machine learning concept of the IntelliAQ project(from Schultz,,2020)
Figure 2: Image of nighttime lights.....