Machine learning improves the estimation of the causal effects in regression discontinuity designs (RDD)

An EU-funded project has extended the range of methodological tools for researchers working with regression discontinuity designs through use of modern machine learning techniques.

Fundamental Research

A valuable methodological tool to study cause and effect relationships, the regression discontinuity design (RDD) plays a significant role in a growing number of research programmes. To further refine RDD and make its analysis more accurate, the RD-ADVANCE project examined and developed new methods to be applied in studies using this evaluation design. “RDDs allows researchers to learn about causal relationships in certain settings where randomised controlled trials are not feasible and observational data has to be used instead,” explains Christoph Rothe(opens in new window), professor of economics and RD-ADVANCE project coordinator. “The newly developed methods will allow researchers in areas, such as economics, education and public health, to better quantify the causal impact of different types of policy measures,” adds Rothe.

Developing new methods for RDD

RD-ADVANCE, coordinated by the University of Mannheim in Germany, was divided into three parts to evaluate different elements of RDD. In the first part, team members developed methods to incorporate covariates (independent variables that can influence the outcome of a given statistical trial) into the analysis of regression discontinuity designs. This was carried out with the help of AI to allow for more precise conclusions. “Specifically, machine learning was used to extract information from a potentially large number of covariates, which in turn is then used to reduce the variance of regression discontinuity estimates of causal effects,” notes Rothe. With the expanded techniques, Rothe says researchers will be able to reduce the uncertainty resulting from limited data and, therefore, give better policy advice. In the second part of the project, the team studied the confidence intervals (CIs) commonly used in statistical analysis. These CIs are based on standard errors clustered by a certain variable, called the running variable. The aim of these common CIs is to quantify the uncertainty related to the treatment effects that are being studied. However, the researchers found that, when dealing with a discrete(opens in new window) running variable, these commonly used CIs may not perform well and may not accurately represent the true uncertainty related to the estimated treatment effects. To address this issue and provide a more reliable approach for researchers, the project team developed two new CIs that can help researchers make more accurate assessments of causal effects.

Accuracy ensured

The RDD approach works with two groups of individuals - a treatment group that receives the intervention and a control group that does not. Participants are grouped based on a cutoff point(opens in new window) in the running variable. Comparing outcomes on either side of the cutoff point helps infer the treatment's causal effect. However, in RDD, there is a risk that individuals might deliberately change certain aspects of themselves or their behaviour in order to influence which treatment group they are assigned to. This could undermine the credibility and validity of the RDD approach. Therefore, to overcome this potential problem and ensure the accuracy of RDD analysis, the third part of the project was dedicated to developing methods for estimation and drawing conclusions that can account for manipulation in RDD studies. The team established a broad framework to address the issue of manipulation, using nonparametric statistical methods, while also considering additional aspects of the manipulation scenario. For more details on the methods developed, there are three publications available on the results page of the project.