Add support for debiased machine learning estimators?

monads · July 18, 2025, 4:01pm

Dear OHDSI community,

As mentioned in my introduction post, I would like to propose adding support for one or more debiased machine learning estimators.

My apologies if someone is already working on this; I did my best to search the documentation and couldn’t find any mention of these estimators. They go by different names across the literature, so it’s possible I overlooked them.

For those unfamiliar with the field of Causal Machine Learning (also known as Debiased Machine Learning), I’d like to offer a brief overview based on my current understanding.

Limitations of Traditional Statistical Methods

Traditional statistical approaches have already proven themselves for biomedical research, but certain limitations remain:

Model sensitivity: Many estimators rely heavily on correct model specification. Methods like Lasso for variable selection, for example, can exclude variables that appear uninformative due to collinearity, potentially leading to model misspecification and biased causal estimates.
Model inflexibility: Even generalized linear models assume a specific functional form (e.g., logit, probit), which may not align well with real-world data structures.

Limitations of Traditional Machine Learning

Machine learning has been revolutionary for predictive tasks but is not directly suited for causal inference:

Lack of counterfactual reasoning: We rarely observe counterfactual outcomes (what would have happened under a different treatment) for the same type of people, usually (unless randomized) the treated and untreated populations are not comparable, ML models are not tuned to pick up this, they will overrepresent one group over the other in their prediction leading to biased outcome predictions. So in short Machine learning excels at predicting outcomes, not at estimating the effect of an intervention or treatment.
Interpretability and inference: ML models often lack interpretable coefficients and do not provide standard errors or confidence intervals out-of-the-box.

Why Causal Machine Learning?

Causal machine learning offers an exciting synthesis of ideas from classical statistics, Bayesian reasoning, and machine learning. It is gaining traction in economics and is increasingly being applied in biomedical research. The primary goal is to estimate causal treatment effects from observational or randomized data in a way that is less sensitive to model misspecification and more robust to complex data structures.

One of the key innovations is the use of influence functions (from nonparametric theory), which specifically correct for plugin-bias introduced by machine learning models when estimating nuisance parameters like the treatment assignment (propensity score) or outcome models. Influence functions can be understood as first-order Taylor approximations of how an estimator reacts to infinitesimal changes in an individual observation. It helps overcome issues like plugin bias, especially in regions of poor overlap (though a minimum overlap is necessary) between treated and untreated groups.

Assumptions

While causal ML provides model-agnostic tools, it still relies on a few key assumptions, though these are usually met:

Consistency
Exchangeability (or conditional exchangeability)
Positivity
Non-interference (no spillover between units)

Double Robust / Debiased Estimators of (Marginal or Conditional) Average Treatment Effects

Several debiased estimators have emerged from this framework, including:

TMLE (Targeted Maximum Likelihood Estimation)
AIPW (Augmented Inverse Probability Weighting)

These estimators are double robust: if either the propensity score model or the outcome model is correctly specified, the resulting causal effect estimate is still unbiased. In addition, they yield valid standard errors and confidence intervals, making them valuable for both exploratory and confirmatory analyses.

Learning Treatment Effects: R-Learner vs. DR-Learner

Causal machine learning also provides powerful methods for understanding treatment heterogeneity:

The R-Learner is particularly useful when the goal is to predict individual treatment effects (ITEs), allowing us to answer the question: “What is the difference in expected outcome for a person with this specific set of covariates if they were treated versus not treated?” This is highly relevant for personalized treatment decisions.
The DR-Learner (Double Robust Learner) is well suited for exploring how treatment effects vary with respect to specific covariates, helping answer questions like: “How does the effect of this drug change across age groups or comorbidity levels?” This is valuable for population-level stratified insights.

Mediation Analysis

Another important application of causal machine learning is mediation analysis, which seeks to understand the pathways through which a treatment or intervention affects an outcome. For instance, rather than simply estimating whether a drug improves recovery time, mediation analysis helps disentangle how much of that effect is direct and how much is mediated through intermediate variables like biomarker changes or patient behavior. Causal ML methods are particularly useful here because they allow for flexible modeling of complex relationships without assuming strict parametric forms, and can handle high-dimensional mediators using machine learning tools while still retaining valid causal interpretations under appropriate assumptions.

Would like feedback from the community on implementing one or more of these techniques. Though libraries exist in python, the best ones seem to be written for R.

Honestly at this moment I wouldn’t know where to start, I still have to learn the basics of OHDSI, I just started reading the book on OHDSI. But at the very least I would like to know if this would be an interesting feature to implement, if so I look forward to collaborating to make this a reality.

Kind regards,
Kenny Hombroeckx