Periodic Reporting for period 1 - AUTOCD (Building the First Automated Causal Discovery Platform)
Berichtszeitraum: 2022-09-01 bis 2024-02-29
Data may contain patterns, correlations, and associations that may suggest causal relationships. Typically, causality is established by controlled experiments, interventions, treatments, or A/B tests, but quite often these are difficult to perform, costly, and time consuming. The field of causal discovery strives to distinguish causal relations from statistical correlations without the use of controlled experiments, or with the use of a limited number of experiments. While there are numerous algorithms and a very rich theory for inducing causal relations and causal models from data, these are quite esoteric. Their correct and optimal use requires deep theoretical knowledge, expertise with the use of these technologies, and knowledge to correctly interpret the results.
The goal of AutoCD project is to create the theory, algorithms, technology and embed them into a software library to facilitate an analyst, without expertise in causal discovery, to correctly and optimally apply causal discovery algorithms in practice, induce causal models and relations, visualize, understand, and interpret the results, and finally, make inferences with the resulting causal models as to how to intervene into a system to achieve certain outcomes. AutoCD aims in bringing causal discovery closer to any practitioner. It automates several parts of the causal discovery and causal inference process.
AutoCD has already been proven valuable in applied projects using commercial 5G telecommunication data to derive control policies for optimizing the download throughput of the network. More applied projects are planned and underway with interest shown from both the industry, as well as academic and research collaborators.
AutoCD is currently released as a software library that is open source. Several scientific publications describing its architecture, engineering, algorithms, and applications are underway. Our team is continuously working to improve and enrich AutoCD to reach a high maturity technological level that will cover a larger and larger scope of application scenarios, types of data, and functionalities. Eventually, the objective is for AutoCD to become a commercial product, while maintaining its code base as open source and to create a user community that will be contributing to the development of its free functionalities. In addition, the objective is for AutoCD to be applied to several real-world problems and data to induce important causal knowledge that will benefit both individuals and the society.
1) Literature review regarding the available causal discovery algorithms for several different settings, scenarios, type of data (e.g. cross-sectional, time-series, continuous, discrete, mixed) and causal assumptions (causal sufficiency, no causal sufficiency, etc.). Algorithms for causal discovery, modeling, inference, utility functions, visualizations, and all other required subsystems and functionalities were researched.
2) Testing of the publicly available algorithms to explore their limits, functionalities, computational scalability, and their other properties. This is a necessary steps before these methods are incorporated within our system. Numerous bugs and problems of publicly available algorithms and libraries were actually discovered during this step.
3) Design of the architecture of AutoCD, deciding the main modules, classes, interconnections, and functionalities to prioritize.
4) Implementation of the functionalities of AutoCD. Implementation included both designing, inventing, and implementing new algorithms that tie together different components, as well as incorporating existing components and algorithms from the literature.
5) Implementation and testing of a synthetic data generator that is capable of producing and generating realistic synthetic data from a known causal model. Such a generator is important to test the capabilities of causal discovery algorithms on realistic data stemming from a known gold standard causal models.
6) Engineering, preprocessing, transforming, and experimenting with real data from a 5G commercial network to make the data ML-ready.
7) Testing and application of AutoCD on synthetic data; testing and application of AutoCD on the real industrial dataset. Induction of new causal knowledge, interpretation of the results for non-expert users, and knowledge transfer to the industrial associate.
8) Packaging of AutoCD as a publicly available, open-source library, that includes the synthetic data, the synthetic data generator, tutorials, and documentation.
The main achievement of the project is the solution to numerous engineering, technological, and algorithmic problems that led to the implementation of a software library that automates several steps of causal discovery. In addition, the library has been successfully applied on an important and challenging industrial problem.
While there are numerous algorithms for handling different aspects of causal discovery and causal inference, as well as multiple libraries of algorithms, any causal analysis requires extensive knowledge of the theory of the algorithms, the details of the implementation, and the interpretation of the results. AutoCD is not another causal discovery algorithm but a system that ties together all aspects of causal discovery and inference to fully automate the process for the non-expert analyst. AutoCD performs automated predictive modeling with concurrent feature selection related to an outcome of interest to reduce dimensionality and turn the problem amenable to causal modeling. It automatically tries numerous causal discovery algorithms and tunes their hyper-parameters to find the one that creates the optimal causal model (and set of equivalent causal models) for the problem at hand. It then provides functionalities for visualizing, interpreting, and applying the causal model for controlling the outcome and performing the optimal intervention policy. There are just a handful of other such systems available and they do not provide functionalities for automatically tuning the algorithmic choice of the causal discovery algorithm and its hyper-parameters. The system is released as open source to instigate research in the field.
The engineering of the system, its design, and its validation are being prepared for multiple scientific publications.
AutoCD has been applied to 5G telecommunication data to induce plausible causal relationships that can be employed to optimize the output of the network. In addition, several other applications of AutoCD are under development including application on causal influences to the outcome of traffic accidents and the severity of the injuries, causal influences of gas and oil pipelines, and others.
AutoCD is meant to lead to a full-fledged product, the creation of a community that could add, modify, and contribute to its development, and its commercial exploitation.