When observational data includes a treatment indicator and some possible confounders, it is very tempting to simply regress the outcome on all features (confounders and treatment alike), extract the coefficients associated with the treatment indicator, and proudly proclaim that “we have controlled for confounders and estimated the treatment effect”.
This approach is wrong. Very wrong. At least as wrong as that DIY electrical job you did last week: it looks all good and neat but you’ve made a critical mistake and there’s no way you can find out without killing yourself.
I can’t explain why this is wrong (I’m not sure I understand it myself) but I can show you some examples proving that this approach is wrong. We’ll work through a few examples, where we compare the results with a traditional regression with a couple of legit causal inference libraries. Since we use simulated data, we’ll also be able to compare with the “true” treatment effect.
In all the following examples, $X$ will be a $N \times 10$ matrix of random covariates; $W$ will be a random treatment binary indicator (which may or may not depend on $X$); $T$ will be the treatment effect; $E$ will be the main effects; and $Y$ will be the outcome variable ($Y^{(0)}$ if untreated, $Y^{(1)}$ if treated), so that $Y \sim \mathcal{N}( T W + E, 1)$. Our task is to estimate $\mathbb{E}(Y^{(1)} – Y^{(0)})$, the conditional average treatment effect (CATE) and $\mathbb{E}(Y^{(1)} – Y^{(0)} \mid W = 1)$, the conditional average treatment effect on the treated (CATT).
For each example we’ll estimate the treatment effects using:
- the causal forests from the grf package
- the double machine learning approach from the dmlmt package
- a random forest, using the ranger package, trained on the entire dataset
- two random forests (again from ranger) trained separately on the treated units and on the untreated units.
- a linear regression
We’ll begin with the examples given in the grf
package’s documentation, but first we load some required packages and set the size of the problem:
library(grf)
devtools::install_github(repo = "MCKnaus/dmlmt")
library(dmlmt)
library(ranger)
library(purrr) # for rbernoulli
N <- 20000 # number of observations
P <- 10 # number of covariates
set.seed(1984)
Case 1: example from causal_forest
’s help page
This first example is the one given in causal_forest
’s help page, where the treatment assignment $W$ is completely randomized and the outcome only depends on $X_1$, $X_2$, and $X_3$:
X <- matrix(
rnorm(N * P),
nrow = N,
ncol = P,
dimnames = list(NULL, paste0('X', 1:P))
)
W <- rbernoulli(N, p = 0.5)
T <- pmax(X[, 1], 0)
E <- X[, 2] + pmin(X[, 3], 0)
Y <- T * W + E + rnorm(N)
The theoretical CATE is identical to the theoretical CATT and is given by the average positive part of the normal distribution, which is just $1 / \sqrt{2\pi}$:
1 / (sqrt(2 * pi))
## [1] 0.3989423
The empirical CATE and CATT agree very well with the theoretical value:
mean(T) # empirical CATE
## [1] 0.4012137
mean(T[W]) # empirical CATT
## [1] 0.3995248
Let’s see now how well the causal models recover the treatment effects. First the causal forest:
c.forest <- causal_forest(X, Y, W)
average_treatment_effect(c.forest, target.sample = 'all')
## estimate std.err
## 0.39445852 0.01440154
average_treatment_effect(c.forest, target.sample = 'treated')
## estimate std.err
## 0.39432334 0.01440661
Pretty good. Next the Double Machine Learning approach:
invisible(dmlmt(X, W, Y))
##
## Binary treatment
##
##
##
## Potential outcomes:
## PO SE
## Treatment 0 -0.3999623 0.0131
## Treatment 1 -0.0077989 0.0139
##
## Average effects
## TE SE t p
## T1 - T0 0.392163 0.015554 25.214 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## # of obs on / off support: 19997 / 3
Also very good. Let’s see now how a traditional regressor performs. There are two approaches. In the first one, we train a single model $\hat{f}()$ on the entire dataset such that $\hat{f}(X, W)$ estimates the outcome for covariates $X$ given treatment assignment $W$; the CATE is then given by $\hat{f}(X, 1) – \hat{f}(X, 0)$.
ranger.model <- ranger(Y ~ ., data = data.frame(X, W, Y))
ranger_cate <- function(ranger.model, X) {
data_untreated <- data.frame(X, W = 0)
data_treated <- data.frame(X, W = 1)
mean(predict(ranger.model, data_treated)$predictions - predict(ranger.model, data_untreated)$predictions)
}
ranger_cate(ranger.model, X)
## [1] 0.3480178
Really bad. Let’s see if another approach might work better. In that second approach, we train two models: $f_{(1)}(X[W])$ on the treated units, $f_{(0)}(X[\neg W])$ on the untreated units. The CATE is then given by $f_{(1)}(X) – f_{(0)}(X)$:
model_treated <- ranger(Y ~ ., data = data.frame(X, W, Y)[W, ])
model_untreated <- ranger(Y ~ ., data = data.frame(X, W, Y)[!W, ])
ranger_cate_two_models <- function(model_treated, model_untreated, X) {
data_untreated <- data.frame(X, W = 0)
data_treated <- data.frame(X, W = 1)
mean(predict(model_treated, data_treated)$predictions - predict(model_untreated, data_untreated)$predictions)
}
ranger_cate_two_models(model_treated, model_untreated, X)
## [1] 0.3945442
Much better. And finally a simple linear regression:
lm(Y ~ ., data.frame(X, W, Y))
##
## Call:
## lm(formula = Y ~ ., data = data.frame(X, W, Y))
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## -0.396914 0.252459 0.991225 0.501298 0.004950 0.002065
## X6 X7 X8 X9 X10 WTRUE
## -0.002804 -0.011666 0.004563 0.010666 0.007237 0.392153
Also very good estimate, which is what you’d expect on what is essentially a randomized trial.
So with this first dataset with no confounding we see that “proper” causal models dominate a traditional regressor, unless we train two separate regressors on the treated and untreated units. Let’s see now the other examples.
Case 2: example with confounding from causal_forest
’s home page
The home page for the grf
package has an example slightly different from the example above, in which the treatment assignment $W$ depends on $X_1$:
X <- matrix(
rnorm(N * P),
nrow = N,
ncol = P,
dimnames = list(NULL, paste0('X', 1:P))
)
W <- rbernoulli(N, 0.4 + 0.2 * (X[, 1] > 0))
T <- pmax(X[, 1], 0)
E <- X[, 2] + pmin(X[, 3], 0)
Y <- T * W + E + rnorm(N)
The theoretical CATE is the same as above, but the theoretical CATT is slightly higher since being treated raises the expected treatment effect:
# Theoretical CATE
1 / sqrt(2*pi)
## [1] 0.3989423
# Empirical CATE
mean(T)
## [1] 0.3933349
# Theoretical CATT
1 / sqrt(2*pi) * 6 / 5
## [1] 0.4787307
# Empirical CATT
mean(T[W])
## [1] 0.4700207
As above, let’s run the four models. First the causal forest:
c.forest <- causal_forest(X, Y, W)
average_treatment_effect(c.forest, target.sample = 'all')
## estimate std.err
## 0.38286619 0.01456181
average_treatment_effect(c.forest, target.sample = 'treated')
## estimate std.err
## 0.45825280 0.01486095
Excellent agreement with the ground truth. Next the dmlmt
:
invisible(dmlmt(X, W, Y))
##
## Binary treatment
##
##
##
## Potential outcomes:
## PO SE
## Treatment 0 -0.400063 0.0132
## Treatment 1 -0.012876 0.0139
##
## Average effects
## TE SE t p
## T1 - T0 0.387187 0.015584 24.846 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## # of obs on / off support: 19998 / 2
Also a very good agreement; note however that the dmlmt
can only estimate the CATE, not the CATT. Let’s see next the one-model regressor:
ranger.model <- ranger(Y ~ ., data = data.frame(X, W, Y))
ranger_cate(ranger.model, X)
## [1] 0.3420549
Just as in the previous case, this is not too good. Let’s see finally the two-model regressors:
model_treated <- ranger(Y ~ ., data = data.frame(X, W, Y)[W, ])
model_untreated <- ranger(Y ~ ., data = data.frame(X, W, Y)[!W, ])
ranger_cate_two_models(model_treated, model_untreated, X)
## [1] 0.3885038
And the linear regression:
lm(Y ~ ., data.frame(X, W, Y))
##
## Call:
## lm(formula = Y ~ ., data = data.frame(X, W, Y))
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## -0.3553287 0.2582300 0.9965371 0.5002037 0.0002953 -0.0066890
## X6 X7 X8 X9 X10 WTRUE
## 0.0058448 0.0001589 -0.0039690 -0.0037644 0.0059526 0.3778513
This, again, is not too bad, especially compared with the one-model regressor.
The previous two examples had relatively simple treatment and main effects. How well do these models perform in more complex situations? To see this I’m going to run them through some of the stress-tests given in section 6 of Nie and Wager (2020).
Case 3: no confounding, non-trivial main effects
In this case the treatment assignment is random and we’re essentially running a randomized trial, but with complex main effects:
X <- matrix(
rnorm(N * P),
nrow = N,
ncol = P,
dimnames = list(NULL, paste0('X', 1:P))
)
W <- rbernoulli(N, p = 0.5)
T <- X[, 1] + log(1 + exp(X[, 2]))
E <- pmax(X[, 1] + X[, 2], X[, 3], 0) + pmax(X[, 4] + X[, 5], 0)
Y <- T * W + E + rnorm(N)
The empirical CATE and CATT are very close, since there’s no confounding:
mean(T)
## [1] 0.804748
mean(T[W])
## [1] 0.8047635
Here are the estimates using a causal forest:
c.forest <- causal_forest(X, Y, W)
average_treatment_effect(c.forest, target.sample = 'all')
## estimate std.err
## 0.80974803 0.01518978
average_treatment_effect(c.forest, target.sample = 'treated')
## estimate std.err
## 0.80902777 0.01520083
Next the dmlmt
:
invisible(dmlmt(X, W, Y))
##
## Binary treatment
##
##
##
## Potential outcomes:
## PO SE
## Treatment 0 1.3958 0.0139
## Treatment 1 2.2089 0.0179
##
## Average effects
## TE SE t p
## T1 - T0 0.81307 0.01876 43.34 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## # of obs on / off support: 19998 / 2
Next the one-model regressor:
ranger.model <- ranger(Y ~ ., data = data.frame(X, W, Y))
ranger_cate(ranger.model, X)
## [1] 0.7359536
Next the two-model regressors:
model_treated <- ranger(Y ~ ., data = data.frame(X, W, Y)[W, ])
model_untreated <- ranger(Y ~ ., data = data.frame(X, W, Y)[!W, ])
ranger_cate_two_models(model_treated, model_untreated, X)
## [1] 0.8168369
And finally the linear regression:
lm(Y ~ ., data.frame(X, W, Y))
##
## Call:
## lm(formula = Y ~ ., data = data.frame(X, W, Y))
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## 1.396381 0.911487 0.674177 0.356589 0.494779 0.513076
## X6 X7 X8 X9 X10 WTRUE
## -0.001450 -0.003947 0.006237 0.001400 0.010172 0.812840
All models, except the one-model random forest, perform rather well on this dataset with no confounders.
Case 4: confounding, non-trivial main effects
X <- matrix(
runif(N * P), # note the uniform distribution
nrow = N,
ncol = P,
dimnames = list(NULL, paste0('X', 1:P))
)
trim <- function(x, eta) pmax(eta, pmin(x, 1 - eta))
W <- rbernoulli(N, trim(sinpi(X[, 1] * X[, 2]), 0.1))
T <- (X[, 1] + X[, 2]) / 2
E <- sinpi(X[, 1] * X[, 2]) + 2 * (X[, 3] - 0.5)^2 + X[, 4] + 0.5 * X[, 5]
Y <- T * W + E + rnorm(N)
Here are the empirical CATE and CATT:
mean(T)
## [1] 0.5006082
mean(T[W])
## [1] 0.5969957
Here are the estimates using a causal forest:
c.forest <- causal_forest(X, Y, W)
average_treatment_effect(c.forest, target.sample = 'all')
## estimate std.err
## 0.53728132 0.01858506
average_treatment_effect(c.forest, target.sample = 'treated')
## estimate std.err
## 0.63199708 0.02226546
Not too bad; perhaps a bit biased on the CATT estimate. Next the dmlmt
:
invisible(dmlmt(X, W, Y))
##
## Binary treatment
##
##
##
## Potential outcomes:
## PO SE
## Treatment 0 1.2626 0.0182
## Treatment 1 1.9945 0.0130
##
## Average effects
## TE SE t p
## T1 - T0 0.731896 0.022071 33.161 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## # of obs on / off support: 19958 / 42
Here the estimate is way too high. Let’s see now the one-model regressor:
ranger.model <- ranger(Y ~ ., data = data.frame(X, W, Y))
ranger_cate(ranger.model, X)
## [1] 0.5871552
Next the two-model regressors:
model_treated <- ranger(Y ~ ., data = data.frame(X, W, Y)[W, ])
model_untreated <- ranger(Y ~ ., data = data.frame(X, W, Y)[!W, ])
ranger_cate_two_models(model_treated, model_untreated, X)
## [1] 0.6317095
And finally the linear regression:
lm(Y ~ ., data.frame(X, W, Y))
##
## Call:
## lm(formula = Y ~ ., data = data.frame(X, W, Y))
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## -0.193808 0.805531 0.822270 -0.053563 1.031979 0.512983
## X6 X7 X8 X9 X10 WTRUE
## -0.004783 -0.041369 0.019418 0.044691 0.010070 0.705682
Except for the causal forest, all models tend to overestimate the treatment effects.
In the final example, we have a complex confounding and non-trivial main effects, but a trivial treatment effect.
Case 5: confounding, trivial treatment effect, non-trivial main effects
X <- matrix(
rnorm(N * P),
nrow = N,
ncol = P,
dimnames = list(NULL, paste0('X', 1:P))
)
W <- rbernoulli(N, 1 / (1 + exp(X[, 2] + X[, 3])))
T <- 1
E <- 2 * log(1 + exp(X[, 1] + X[, 2] + X[, 3]))
Y <- T * W + E + rnorm(N)
In this case we have CATE = CATT = 1. Let’s see how the causal forest performs:
c.forest <- causal_forest(X, Y, W)
average_treatment_effect(c.forest, target.sample = 'all')
## estimate std.err
## 0.94210993 0.01783883
average_treatment_effect(c.forest, target.sample = 'treated')
## estimate std.err
## 0.94966382 0.02062717
Pretty good. Next the dmlmt
:
invisible(dmlmt(X, W, Y))
##
## Binary treatment
##
##
##
## Potential outcomes:
## PO SE
## Treatment 0 1.9719 0.0254
## Treatment 1 2.9588 0.0273
##
## Average effects
## TE SE t p
## T1 - T0 0.986893 0.032839 30.053 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## # of obs on / off support: 19855 / 145
Very good estimates too. How about the one-model regressor?
ranger.model <- ranger(Y ~ ., data = data.frame(X, W, Y))
ranger_cate(ranger.model, X)
## [1] 0.613593
Oops, not that good. Perhaps the two-model regressor does better?
model_treated <- ranger(Y ~ ., data = data.frame(X, W, Y)[W, ])
model_untreated <- ranger(Y ~ ., data = data.frame(X, W, Y)[!W, ])
ranger_cate_two_models(model_treated, model_untreated, X)
## [1] 0.7121187
Disaster. And finally the linear regression:
lm(Y ~ ., data.frame(X, W, Y))
##
## Call:
## lm(formula = Y ~ ., data = data.frame(X, W, Y))
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## 1.9846267 1.0057652 0.9917930 1.0018414 0.0025015 -0.0173148
## X6 X7 X8 X9 X10 WTRUE
## -0.0007228 -0.0001395 0.0019996 -0.0055046 -0.0025001 0.9904264
Much better.
Conclusion
The following table summarizes the empirical CATE in each case, and the CATE estimated by each algorithm:
Case | CATE | GRF | DMLMT | RF1 | RF2 | LR |
---|---|---|---|---|---|---|
1 | 0.40 | 0.39 | 0.39 | 0.35 | 0.39 | 0.39 |
2 | 0.39 | 0.38 | 0.39 | 0.34 | 0.39 | 0.38 |
3 | 0.80 | 0.81 | 0.81 | 0.74 | 0.82 | 0.81 |
4 | 0.50 | 0.54 | 0.73 | 0.59 | 0.63 | 0.71 |
5 | 1.00 | 0.94 | 0.99 | 0.61 | 0.71 | 0.99 |
The causal forest from the grf
package always dominates the other methods; like I said at the beginning, I’m not entirely sure why, but this quick study should alert you to the fact that causal studies are hard, because unlike traditional regression problems here you have no ground truth against which to cross-validate your model. Your regression runs just fine, you can p-value your results all the way to statistical hell, the fact remains that you just don’t know what you are doing.
I would love to hear from anyone who could explain in simple terms why we see such a variety of estimation accuracies.
4 thoughts on “No, you have not controlled for confounders”
Comments are closed.
Why not try linear regression?
“`
Call:
lm(formula = Y ~ ., data = data.frame(X, W, Y))
Residuals:
Min 1Q Median 3Q Max
-3.8638 -0.7315 -0.0027 0.7357 4.7721
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.396914 0.010989 -36.118 <2e-16 ***
X1 0.252459 0.007790 32.409 <2e-16 ***
X2 0.991225 0.007825 126.681 <2e-16 ***
X3 0.501298 0.007791 64.342 <2e-16 ***
X4 0.004950 0.007792 0.635 0.525
X5 0.002065 0.007732 0.267 0.789
X6 -0.002804 0.007757 -0.361 0.718
X7 -0.011666 0.007726 -1.510 0.131
X8 0.004563 0.007748 0.589 0.556
X9 0.010666 0.007661 1.392 0.164
X10 0.007237 0.007786 0.929 0.353
WTRUE 0.392153 0.015558 25.206 <2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.1 on 19988 degrees of freedom
Multiple R-squared: 0.5241, Adjusted R-squared: 0.5238
F-statistic: 2001 on 11 and 19988 DF, p-value: < 2.2e-16
“`
Linear regression will indeed work… sometimes. It works fine in case 1, 3, and 5, but not in case 2 and 4. I’ll update the table with the results from a linear regression.
I suspect it works fine in cases 1 and 3 because there’s no confounding: these are essentially randomized trials. In fact, you can just as well get the CATE with
mean(Y[W]) – mean(Y[!W])
But my point is that in case 2 and 4 your results will be wrong, even though you have “controlled” for all confounders and even though there’s nothing obviously wrong with your regression.