Causal Effect Estimation¶
This notebook serves as a demonstration of the CausalEffectEstimation
module for estimating causal effects on both generated and real datasets.
import pyAgrum as gum
import pyAgrum.lib.discretizer as disc
import pyAgrum.lib.notebook as gnb
import pyAgrum.lib.explain as gexpl
import pyAgrum.causal as csl
import pyAgrum.causal.notebook as cslnb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
1. ACE estimations from generated RCT data¶
To estimate the Average Causal Effects (ACE) from data generated under conditions resembling a Randomized Controlled Trial (RCT), we define two generative models.
1.1 Dataset¶
Consider two generative models:
A linear generative model described by the equation: $$ Y = 3X_1 + 2X_2 -2X_3 -0.8X_4 + T(2X_1 + 5X_3 +3X_4) $$
And a non-linear generative model described by the equation: $$ Y = 3X_1 + 2X_2^2 -2X_3 -0.8X_4 +10T $$
Where $ (X_1,X_2,X_3,X_4) \sim \mathcal{N}_4((1,1,1,1), I_4) $, $T \sim \mathcal{Ber}(1/2)$ and $ (X_1,X_2,X_3,X_4,T) $ are jointly independent in both of the models.
Data from the models can be obtatined by the functions given below.
def linear_simulation(n : int = 100000, sigma : float = 1) -> pd.DataFrame:
"""
Returns n observations from the linear model with normally distributed
noise with expected value 0 and standard deviation sigma.
"""
X1 = np.random.normal(1, 1, n)
X2 = np.random.normal(1, 1, n)
X3 = np.random.normal(1, 1, n)
X4 = np.random.normal(1, 1, n)
epsilon = np.random.normal(0, sigma, n)
T=np.random.binomial(1, 0.5, n)
Y= 3*X1+2*X2-2*X3-0.8*X4+T*(2*X1+5*X3+3*X4)+epsilon
d=np.array([T,X1,X2,X3,X4,Y])
df_data = pd.DataFrame(data=d.T,columns=['T','X1','X2','X3','X4','Y'])
df_data["T"] = df_data["T"].astype(int)
return df_data
def non_linear_simulation(n : int = 100000, sigma : float = 1) -> pd.DataFrame:
"""
Returns n observations from the non-linear model with normally distributed
noise with expected value 0 and standard deviation sigma.
"""
X1 = np.random.normal(1, 1, n)
X2 = np.random.normal(1, 1, n)
X3 = np.random.normal(1, 1, n)
X4 = np.random.normal(1, 1, n)
epsilon = np.random.normal(0, sigma, n)
T=np.random.binomial(1, 0.5, n)
Y= 3*X1+ 2*X2**2-2*X3-0.8*X4+10*T+epsilon
d=np.array([T,X1,X2,X3,X4,Y])
df_data = pd.DataFrame(data=d.T,columns=['T','X1','X2','X3','X4','Y'])
df_data["T"] = df_data["T"].astype(int)
return df_data
Furthermore, the expected values of $Y(0)$ and $Y(1)$ can be explicitly calculated, providing us the theoretical ACE of $\tau = 10$ which will serve as a point of reference for the estimations.
We will explore how the CausalEffectEstimation
module can estimate the causal effect of $T$ on $Y$ in both of the generated datasets.
1.2 Setup¶
First, genereate the data using the previously defined functions, and specify the causal graph of the variables. A single graph will be applicable to both datasets, as they share the same variables and causal structure.
linear_data = linear_simulation()
non_linear_data = non_linear_simulation()
bn = gum.fastBN("Y<-T; Y<-X1; Y<-X2; Y<-X3; Y<-X4")
causal_model = csl.CausalModel(bn)
cslnb.showCausalModel(causal_model, size="10")
If the CausalBNEstimator
estimation method is not employed for estimation process, the model generated using gum.fastBN
will suffice, as only the graph structure of the Causal Bayesian Network will be used for causal identification.
However, if the Causal Bayesian Network estimator is utilized, it will be necessary to provide a csl.CausalModel
object with appropriate discretization, as the Conditional Probability Tables of the model will be used for estimation. Here we use the discretizer
module to perform this task, the arcs are added manually.
Selecting an optimal discretization is crucial: a coarse discretization may lead to poor estimation due to its inability to capture fine variations in the distribution, while an overly fine discretization may result in too many parameters, making it difficult for the parameter learning algorithm to accurately estimate the distribution. Therefore, the discretization should strike a balance, being neither too coarse nor too fine.
discretizer = disc.Discretizer(
defaultDiscretizationMethod="uniform",
defaultNumberOfBins=20
)
disc_bn = discretizer.discretizedTemplate(linear_data)
disc_bn.beginTopologyTransformation()
for node in {"T", "X1", "X2", "X3", "X4"}:
disc_bn.addArc(node, "Y")
disc_bn.endTopologyTransformation()
disc_causal_model = csl.CausalModel(disc_bn)
for id, _ in disc_bn:
print(disc_bn.variable(id))
T:Range([0,1]) X1:Discretized(<(-3.14718;-2.72853[,[-2.72853;-2.30989[,[-2.30989;-1.89124[,[-1.89124;-1.4726[,[-1.4726;-1.05395[,[-1.05395;-0.635307[,[-0.635307;-0.216662[,[-0.216662;0.201983[,[0.201983;0.620628[,[0.620628;1.03927[,[1.03927;1.45792[,[1.45792;1.87656[,[1.87656;2.29521[,[2.29521;2.71385[,[2.71385;3.1325[,[3.1325;3.55114[,[3.55114;3.96979[,[3.96979;4.38843[,[4.38843;4.80708[,[4.80708;5.22572)>) X2:Discretized(<(-3.81577;-3.37564[,[-3.37564;-2.9355[,[-2.9355;-2.49537[,[-2.49537;-2.05523[,[-2.05523;-1.6151[,[-1.6151;-1.17496[,[-1.17496;-0.734826[,[-0.734826;-0.294691[,[-0.294691;0.145445[,[0.145445;0.58558[,[0.58558;1.02572[,[1.02572;1.46585[,[1.46585;1.90599[,[1.90599;2.34612[,[2.34612;2.78626[,[2.78626;3.22639[,[3.22639;3.66653[,[3.66653;4.10666[,[4.10666;4.5468[,[4.5468;4.98693)>) X3:Discretized(<(-3.60739;-3.14829[,[-3.14829;-2.68918[,[-2.68918;-2.23008[,[-2.23008;-1.77098[,[-1.77098;-1.31187[,[-1.31187;-0.852771[,[-0.852771;-0.393668[,[-0.393668;0.0654355[,[0.0654355;0.524539[,[0.524539;0.983642[,[0.983642;1.44275[,[1.44275;1.90185[,[1.90185;2.36095[,[2.36095;2.82006[,[2.82006;3.27916[,[3.27916;3.73826[,[3.73826;4.19737[,[4.19737;4.65647[,[4.65647;5.11557[,[5.11557;5.57468)>) X4:Discretized(<(-3.05233;-2.62422[,[-2.62422;-2.1961[,[-2.1961;-1.76799[,[-1.76799;-1.33987[,[-1.33987;-0.91176[,[-0.91176;-0.483645[,[-0.483645;-0.0555308[,[-0.0555308;0.372584[,[0.372584;0.800698[,[0.800698;1.22881[,[1.22881;1.65693[,[1.65693;2.08504[,[2.08504;2.51316[,[2.51316;2.94127[,[2.94127;3.36939[,[3.36939;3.7975[,[3.7975;4.22562[,[4.22562;4.65373[,[4.65373;5.08184[,[5.08184;5.50996)>) Y:Discretized(<(-15.324;-12.5206[,[-12.5206;-9.71726[,[-9.71726;-6.91389[,[-6.91389;-4.11051[,[-4.11051;-1.30714[,[-1.30714;1.49624[,[1.49624;4.29961[,[4.29961;7.10299[,[7.10299;9.90637[,[9.90637;12.7097[,[12.7097;15.5131[,[15.5131;18.3165[,[18.3165;21.1199[,[21.1199;23.9232[,[23.9232;26.7266[,[26.7266;29.53[,[29.53;32.3334[,[32.3334;35.1367[,[35.1367;37.9401[,[37.9401;40.7435)>)
We are now prepared to instantiate the CausalEffectEstimation
object for both datasets.
linear_cee = csl.CausalEffectEstimation(linear_data, disc_causal_model)
print(linear_cee)
<pyAgrum.causal.causalEffectEstimation._CausalEffectEstimation.CausalEffectEstimation object at 0x1a91b6a8a50> Dataframe : <pandas.core.frame.DataFrame object at 0x1a9191eab10> - shape : (100000, 6) - columns : Index(['T', 'X1', 'X2', 'X3', 'X4', 'Y'], dtype='object') - memory usage : 4.800132 MB Causal Model : <pyAgrum.causal._CausalModel.CausalModel object at 0x000001A91B694090> - names : {0: 'T', 1: 'X1', 2: 'X2', 3: 'X3', 4: 'X4', 5: 'Y'} - causal BN : BN{nodes: 6, arcs: 5, domainSize: 10^6.80618, dim: 6080077, mem: 48Mo 848Ko 656o} - observ. BN : BN{nodes: 6, arcs: 5, domainSize: 10^6.80618, dim: 6080077, mem: 48Mo 848Ko 656o}
non_linear_cee = csl.CausalEffectEstimation(non_linear_data, disc_causal_model)
print(non_linear_cee)
<pyAgrum.causal.causalEffectEstimation._CausalEffectEstimation.CausalEffectEstimation object at 0x1a91b6cc250> Dataframe : <pandas.core.frame.DataFrame object at 0x1a918810590> - shape : (100000, 6) - columns : Index(['T', 'X1', 'X2', 'X3', 'X4', 'Y'], dtype='object') - memory usage : 4.800132 MB Causal Model : <pyAgrum.causal._CausalModel.CausalModel object at 0x000001A91B694090> - names : {0: 'T', 1: 'X1', 2: 'X2', 3: 'X3', 4: 'X4', 5: 'Y'} - causal BN : BN{nodes: 6, arcs: 5, domainSize: 10^6.80618, dim: 6080077, mem: 48Mo 848Ko 656o} - observ. BN : BN{nodes: 6, arcs: 5, domainSize: 10^6.80618, dim: 6080077, mem: 48Mo 848Ko 656o}
1.3 Causal Indentification¶
The subsequent step involves identifying the causal criterion to be used for estimation. This is crucial, as most estimators rely on strong assumptions regarding the underlying causal structure of the data-generating process. Incorrect specification of the adjustment may compromise the guarantee of asymptotic normality.
linear_cee.identifyAdjustmentSet(intervention="T", outcome="Y")
non_linear_cee.identifyAdjustmentSet(intervention="T", outcome="Y", verbose=False)
Randomized Controlled Trial adjustment found. Supported estimators include: - CausalModelEstimator - DM If the outcome variable is a cause of other covariates in the causal graph, Backdoor estimators may also be used.
'Randomized Controlled Trial'
Consistent with the data generation, the adjustment identified is the Randomized Control Trial adjustment. This yields a list of the various estimators supported by this adjustment.
1.4 Causal Effect Estimation¶
Once the adjustment is identified, we can proceed to estimation using the supported estimators. First, the estimator must be fitted to the data.
linear_cee.fitDM()
non_linear_cee.fitDM()
The estimation is obtained by calling the estimateCausalEffect
method.
linear_tau_hat = linear_cee.estimateCausalEffect()
non_linear_tau_hat = non_linear_cee.estimateCausalEffect()
print(f"ACE linear = {linear_tau_hat}, MAPE = {abs((linear_tau_hat-10)*10)} %")
print(f"ACE non linear = {non_linear_tau_hat}, MAPE = {abs((non_linear_tau_hat-10)*10)} %")
ACE linear = 10.022196692897303, MAPE = 0.22196692897303194 % ACE non linear = 10.047761608442126, MAPE = 0.477616084421264 %
The difference-in-means estimator, which is the simplest estimator for the ACE, yields mostly accurate results. This is expected in an RCT environment where the treatment is independent of confounders, making intervention equivalent to observation.
1.5 User specified adjustment¶
If the user wish to apply an alternative adjustment, they may specify their own set of variables for each component of the adjustment. However, please note that such custom adjustments do not guarantee unbiased estimation and may not ensure an error-free estimation process.
linear_cee.useBackdoorAdjustment(intervention="T", outcome="Y", confounders={"X1", "X2", "X3", "X4"})
non_linear_cee.useBackdoorAdjustment(intervention="T", outcome="Y", confounders={"X1", "X2", "X3", "X4"})
linear_cee.fitSLearner()
non_linear_cee.fitSLearner()
linear_tau_hat = linear_cee.estimateCausalEffect()
non_linear_tau_hat = non_linear_cee.estimateCausalEffect()
print(f"ACE linear = {linear_tau_hat}, MAPE = {abs((linear_tau_hat-10)*10)} %")
print(f"ACE non linear = {non_linear_tau_hat}, MAPE = {abs((non_linear_tau_hat-10)*10)} %")
ACE linear = 9.993396025658598, MAPE = 0.06603974341402363 % ACE non linear = 10.010557707228799, MAPE = 0.10557707228798563 %
In this case, RCT adjustment also supports Backdoor adjustment, we thus get mostly accurate estimations. Let's see how the estimation would be if we specify an uncompatible adjustment.
1.6 Incorrectly specified adjustment¶
We will use the frontdoor adjustment to illustrate the behaviours of incorrect specification.
linear_cee.useFrontdoorAdjustment(intervention="T", outcome="Y", mediators={"X1", "X2", "X3", "X4"})
non_linear_cee.useFrontdoorAdjustment(intervention="T", outcome="Y", mediators={"X1", "X2", "X3", "X4"})
linear_cee.fitSimplePlugIn()
non_linear_cee.fitSimplePlugIn()
linear_tau_hat = linear_cee.estimateCausalEffect()
non_linear_tau_hat = non_linear_cee.estimateCausalEffect()
print(f"ACE linear = {linear_tau_hat}, MAPE = {abs((linear_tau_hat-10)*10)} %")
print(f"ACE non linear = {non_linear_tau_hat}, MAPE = {abs((non_linear_tau_hat-10)*10)} %")
ACE linear = 0.027356079062413947, MAPE = 99.72643920937585 % ACE non linear = 0.03652571470181629, MAPE = 99.63474285298183 %
As anticipated, using an incorrect adjustment set results in a heavily biased estimation. In this case, the ACE is close to zero, indicating that the estimator incorrectly predicts no causal effect of $T$ on $Y$.
1.7 Causal Bayesian Network Estimation¶
To fully utilize the causal model within the estimation process, we will now use the Conditional Probability Tables of the Causal Bayesian Network via the CausalBNEstimator
, rather than relying solely on the underlying causal graph. The procedure will follow the same methodology as previously applied.
linear_cee.identifyAdjustmentSet(intervention="T", outcome="Y", verbose=False)
non_linear_cee.identifyAdjustmentSet(intervention="T", outcome="Y", verbose=False)
'Randomized Controlled Trial'
linear_cee.fitCausalBNEstimator()
non_linear_cee.fitCausalBNEstimator()
linear_tau_hat = linear_cee.estimateCausalEffect()
non_linear_tau_hat = non_linear_cee.estimateCausalEffect()
print(f"ACE linear = {linear_tau_hat}, MAPE = {abs((linear_tau_hat-10)*10)} %")
print(f"ACE non linear = {non_linear_tau_hat}, MAPE = {abs((non_linear_tau_hat-10)*10)} %")
ACE linear = 9.04886401302389, MAPE = 9.511359869761105 % ACE non linear = 9.023563116890791, MAPE = 9.764368831092085 %
2. ACE estimations from real RCT data¶
In this section, we will estimate the ACE under real-world RCT conditions.
2.1 Dataset¶
The data used in this notbook come from the Tennessee Student/Teacher Achievement Ratio (STAR) trial. This randomized controlled trial was designed to assess the effects of smaller class sizes in primary schools (T) on students' academic performance (Y).
The covariates in this study include:
gender
age
g1freelunch
being the number of lunchs provided to the child per dayg1surban
the localisation of the school (inner city or rural)ethnicity
# Preprocessing
# Load data - read everything as a string and then cast
star_df = pd.read_csv("./res/STAR_data.csv", sep=",", dtype=str)
star_df = star_df.rename(columns={"race": "ethnicity"})
# Fill na
star_df = star_df.fillna({"g1freelunch": 0, "g1surban": 0})
drop_star_l = ["g1tlistss", "g1treadss", "g1tmathss", "g1classtype",
"birthyear", "birthmonth", "birthday", "gender",
"ethnicity", "g1freelunch", "g1surban"]
star_df = star_df.dropna(subset=drop_star_l, how='any')
# Cast value types before processing
star_df["gender"] = star_df["gender"].astype(int)
star_df["ethnicity"] = star_df["ethnicity"].astype(int)
star_df["g1freelunch"] = star_df["g1freelunch"].astype(int)
star_df["g1surban"] = star_df["g1surban"].astype(int)
star_df["g1classtype"] = star_df["g1classtype"].astype(int)
# Keep only class type 1 and 2 (in the initial trial,
# 3 class types where attributed and the third one was big classes
# but with a teaching assistant)
star_df = star_df[~(star_df["g1classtype"] == 3)].reset_index(drop=True)
# Compute the outcome
star_df["Y"] = (star_df["g1tlistss"].astype(int) +
star_df["g1treadss"].astype(int) +
star_df["g1tmathss"].astype(int)) / 3
# Compute the treatment
star_df["T"] = star_df["g1classtype"].apply(lambda x: 0 if x == 2 \
else 1)
# Transform date to obtain age (Notice: if na --> date is NaT)
star_df["date"] = pd.to_datetime(star_df["birthyear"] + "/"
+ star_df["birthmonth"] + "/"
+ star_df["birthday"], yearfirst=True, errors="coerce")
star_df["age"] = (np.datetime64("1985-01-01") - star_df["date"])
star_df["age"] = star_df["age"].dt.days / 365.25
# Keep only covariates we consider predictive of the outcome
star_covariates_l = ["gender", "ethnicity", "age",
"g1freelunch", "g1surban"]
star_df = star_df[["Y", "T"] + star_covariates_l]
# Map numerical to categorical
star_df["gender"] = star_df["gender"].apply(lambda x: "Girl" if x == 2 \
else "Boy").astype("category")
star_df["ethnicity"] = star_df["ethnicity"].map( \
{1:"White", 2:"Black", 3:"Asian",
4:"Hispanic",5:"Nat_American", 6:"Other"}).astype("category")
star_df["g1surban"] = star_df["g1surban"].map( \
{1:"Inner_city", 2:"Suburban",
3:"Rural", 4:"Urban"}).astype("category")
star_df.describe()
Y | T | age | g1freelunch | |
---|---|---|---|---|
count | 4215.000000 | 4215.000000 | 4215.000000 | 4215.000000 |
mean | 540.095848 | 0.428233 | 4.879872 | 1.471886 |
std | 39.267221 | 0.494881 | 0.465104 | 0.534171 |
min | 439.333333 | 0.000000 | 3.129363 | 0.000000 |
25% | 511.333333 | 0.000000 | 4.525667 | 1.000000 |
50% | 537.333333 | 0.000000 | 4.818617 | 1.000000 |
75% | 566.000000 | 1.000000 | 5.111567 | 2.000000 |
max | 670.666667 | 1.000000 | 7.225188 | 2.000000 |
It appears that there are more units in the control group. However, the control and treatment groups appear to be similar in distribution, indicating that the ignorability assumption is likely satisfied.
We will explore how the CausalEffectEstimation
module can estimate the causal effect of $T$ on $Y$ in both of the given datasets.
2.2 Structure Learning and Setup¶
In the absence of a predefined causal structure, structure learning is utilized to uncover the underlying relationships between the variables in the dataset. To facilitate this process, a slice order will be imposed on the variables. This approach will serve as the foundation for deriving the necessary causal structure for subsequent analysis.
To enable the application of structure learning algorithms, the variables will first be discretized using the discretizer
module. Following this, the causal structure will be derived using gum.BNLearner
.
discretizer = disc.Discretizer(defaultDiscretizationMethod='uniform')
discretizer.setDiscretizationParameters("age", 'uniform', 24)
discretizer.setDiscretizationParameters("Y", 'uniform', 30)
template = discretizer.discretizedTemplate(star_df)
learner = gum.BNLearner(star_df, template)
learner.useNMLCorrection()
learner.useSmoothingPrior(1e-6)
learner.setSliceOrder([["T", "ethnicity", "gender", "age"],
["g1surban", "g1freelunch", ], ["Y"]])
bn = learner.learnBN()
print(learner)
gnb.sideBySide(gexpl.getInformation(bn, size="50"),
gnb.getInference(bn, size="50"))
Filename : C:\Users\phw\AppData\Local\Temp\tmpu9hb6hu9.csv Size : (4215,7) Variables : Y[30], T[2], gender[2], ethnicity[6], age[24], g1freelunch[3], g1surban[4] Induced types : False Missing values : False Algorithm : MIIC Correction : NML Prior : Smoothing Prior weight : 0.000001 Constraint Slice Order : {ethnicity:0, T:0, g1surban:1, age:0, gender:0, g1freelunch:1, Y:2}
This initial approach appears promising, as the inferred causal relationships are somewhat consistent with what might be expected from an non-expert perspective.
Now given the causal structure, we are set to instanciate the CausalEffectEstimation
class to perform estimation.
causal_model = csl.CausalModel(bn)
cee = csl.CausalEffectEstimation(star_df, causal_model)
2.3 Causal Identification¶
The next step involves formal causal identification. As expected, we identify the RCT adjustment, consistent with the experimental design.
cee.identifyAdjustmentSet(intervention="T", outcome="Y")
Randomized Controlled Trial adjustment found. Supported estimators include: - CausalModelEstimator - DM If the outcome variable is a cause of other covariates in the causal graph, Backdoor estimators may also be used.
'Randomized Controlled Trial'
2.4 Causal Effect Estimation¶
Once the ajustment identified, we can use the appropiate estimators for estimation.
cee.fitDM()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 12.814738911047016
cee.fitCausalBNEstimator()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 11.515235748777876
Let's evaluate how the backdoor adjustment estimators compare to the previously obtained estimates. In this analysis, we control for the g1freelunch
variable.
cee.useBackdoorAdjustment(intervention="T", outcome="Y", confounders={"g1freelunch"})
cee.fitSLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 11.616979201549725
cee.fitTLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 11.616705516924473
cee.fitIPW()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 11.77382443916551
cee.fitPStratification()
tau_hat = cee.estimateCausalEffect()
print(f"ACE = {tau_hat}")
ACE = 10.920977442188093
3. ACE estimations from generated observational data¶
Next, we will estimate the ACE using generated data that does not adhere to RCT conditions. The generative model follows the specification from Lunceford & Davidian (2004)
3.1 Dataset¶
Consider the following generative model:
- The outcome variable $Y$ is generated according the following equation: $$ \begin{align*} Y & = - X_1 + X_2 - X_3 +2 T -V_1 + V_2 + V_3 \\ & = \langle \nu , \boldsymbol{Z} \rangle + \langle \xi, \boldsymbol{V} \rangle \end{align*}$$
Where $\nu = (0, -1, 1, -1, 2)^\intercal$, $\boldsymbol{Z} = (1, X_1, X_2, X_3, T)^\intercal$, $\xi = (-1,1,1)^\intercal$ and $\boldsymbol{V} = (V_1, V_2, V_3)^\intercal$.
- The covariates are distributed as $X_3 \sim \text{Bernoulli}\left(0.2\right)$. Conditionally $X_3$, the distribution of the other variables is defined as:
If $X_{3} = 0$, $V_3 \sim \text{Bernoulli}\left(0.25\right)$ and ${\left(X_{1}, V_{1}, X_{2}, V_{2}\right)}^{\intercal} \sim \mathcal{N}_4({\tau}_{0}, \Sigma)$
If $ X_{3} = 1$, $V_3 \sim \text{Bernoulli}\left(0.75\right)$ and ${\left(X_{1}, V_{1}, X_{2}, V_{2}\right)}^{\intercal} \sim \mathcal{N}_4({\tau}_{1}, \Sigma)$ with $$ {\tau}_{1} = \left(\begin{array}{c} 1 \\ 1 \\ -1\\ -1 \end{array}\right), {\tau}_{0} = \left(\begin{array}{c} -1 \\ -1 \\ 1\\ 1 \end{array}\right) \ \text{and} \ \Sigma = \left(\begin{array}{cccc} 1 & 0.5 & -0.5 & -0.5\\ 0.5 & 1 & -0.5 & -0.5 \\ -0.5 & -0.5 & 1 & 0.5 \\ -0.5 & -0.5 & 0.5 & 1 \end{array} \right)$$
- The treatment $T$ is generated as a Bernoulli of the propensity score: $$ \begin{align*} \mathbb{P}[T=1|X] &= e\left(X, \beta\right) \\ &= (1+\exp (-0.6 X_{1} +0.6 X_{2} - 0.6 X_{3}))^{-1} \\ &= \frac{1}{1+e^{-\langle \beta , \boldsymbol{X} \rangle}} \\ \mathbb{P}[T=0|X] &= 1-\mathbb{P}[T=1|X] \end{align*}$$ With $\beta = {\left(0, 0.6, -0.6, 0.6\right)}^{\intercal}$ and $\boldsymbol{X} = (1, X_1, X_2, X_3)^{\intercal}$.
# Model parameters
XI = np.array([-1, 1, 1])
NU = np.array([0, -1, 1, -1, 2])
BETA = np.array([0, 0.6, -0.6, 0.6])
TAU_0 = np.array([-1, -1, 1, 1])
TAU_1 = TAU_0 * -1
SIGMA = np.array([[1, 0.5, -0.5, -0.5],
[0.5, 1, -0.5, -0.5],
[-0.5, -0.5, 1, 0.5],
[-0.5, -0.5, 0.5, 1]], dtype=float)
def generate_lunceford(n=100000):
# Generate data
x3 = np.random.binomial(1, 0.2, n)
v3 = np.random.binomial(1, (0.75 * x3 + (0.25 * (1 - x3))), n)
# If x3=0 you have a model, if x3=1 you have another one
x1v1x2v2_x3_0_matrix = np.random.multivariate_normal(TAU_0, SIGMA, size=n, check_valid='warn', tol=1e-8)
x1v1x2v2_x3_1_matrix = np.random.multivariate_normal(TAU_1, SIGMA, size=n, check_valid='warn', tol=1e-8)
x1v1x2v2_x3 = np.where(np.repeat(x3[:, np.newaxis], 4, axis=1) == 0, x1v1x2v2_x3_0_matrix, x1v1x2v2_x3_1_matrix)
# Concatenate values
xv = np.concatenate([x1v1x2v2_x3, np.expand_dims(x3, axis=1), np.expand_dims(v3, axis=1)], axis=1)
# Compute e, a, and y
x = xv[:, [0,2,4]]
v = xv[:, [1,3,5]]
e = np.power(1 + np.exp(- BETA[0] - x.dot(BETA[1:])), -1)
a = np.random.binomial(1, e, n)
y = x.dot(NU[1:-1]) + v.dot(XI) + a*NU[-1] + np.random.binomial(1, e, n) + np.random.normal(0, 1, n)
# Create the final df
synthetic_data_df = pd.DataFrame(np.concatenate([x, np.expand_dims(a, axis=1), v, np.expand_dims(y, axis=1)], axis=1), columns=["X1", "X2", "X3", "T", "V1", "V2", "V3", "Y"])
synthetic_data_df["X3"] = synthetic_data_df["X3"].astype(int)
synthetic_data_df["V3"] = synthetic_data_df["V3"].astype(int)
synthetic_data_df["T"] = synthetic_data_df["T"].astype(int)
return synthetic_data_df
df = generate_lunceford()
df.head()
X1 | X2 | X3 | T | V1 | V2 | V3 | Y | |
---|---|---|---|---|---|---|---|---|
0 | -1.049773 | 1.647873 | 0 | 0 | -1.848568 | 2.006223 | 0 | 6.131881 |
1 | -1.651112 | 1.396177 | 0 | 0 | -1.143154 | 0.800736 | 0 | 5.947488 |
2 | -1.838529 | 1.080599 | 0 | 1 | -0.937024 | 0.384680 | 1 | 7.247444 |
3 | -1.092876 | 0.609056 | 0 | 0 | -0.608222 | -0.221501 | 0 | 3.418656 |
4 | 0.241816 | -0.473207 | 0 | 0 | 0.990614 | 0.297986 | 0 | -0.809088 |
Here, the exact ATE can be explicitly calculated using the previously defined assumptions.
$$ \begin{align*} \mathbb{E}[Y(1) - Y(0)] &= \mathbb{E}_{X,V}[\mathbb{E}[Y \mid T=1, X, V] - \mathbb{E}[Y \mid T=0, X, V]]\\ &= \mathbb{E} [ (- X_1 + X_2 - X_3 + 2 \times 1 -V_1 + V_2 + V_3)] - \mathbb{E}[(- X_1 + X_2 - X_3 + 2 \times 0 -V_1 + V_2 + V_3 ) ] \\ &= 2 \end{align*} $$
3.2 Setup¶
Given that we know the data-generating mechanism, we can directly specify the causal structure by constructing a Causal Bayesian Network. To utilize the CausalBNEstimator, we will first employ the discretizer module to determine the domains of the variables.
discretizer = disc.Discretizer(
defaultDiscretizationMethod='uniform',
defaultNumberOfBins=10
)
discretizer.setDiscretizationParameters("Y", 'uniform', 60)
bn = discretizer.discretizedTemplate(df)
bn.beginTopologyTransformation()
for _, name in bn:
if name != "Y":
bn.addArc(name, "Y")
for X in ["X1", "X2", "X3"]:
bn.addArc(X, "T")
for XV in ["X1", "V1", "X2", "V2"]:
bn.addArc("X3", XV)
bn.addArc("X3", "V3")
bn.endTopologyTransformation()
causal_model = csl.CausalModel(bn)
cslnb.showCausalModel(causal_model, size="10")
cee = csl.CausalEffectEstimation(df, causal_model)
3.3 Causal Identification¶
cee.identifyAdjustmentSet(intervention="T", outcome="Y")
Backdoor adjustment found. Supported estimators include: - CausalModelEstimator - SLearner - TLearner - XLearner - PStratification - IPW
'Backdoor'
3.4 Causal Estimation¶
cee.fitCausalBNEstimator()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2*100)} %")
ACE linear = 1.4049944870061684, MAPE = 29.75027564969158 %
cee.fitSLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %")
ACE linear = 2.003169942031878, MAPE = 0.15849710159390185 %
cee.fitTLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %")
ACE linear = 1.9978358695427927, MAPE = 0.1082065228603657 %
cee.fitIPW()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %")
ACE linear = 2.0917446591374964, MAPE = 4.5872329568748205 %
3.5 Structure Learning with imposed slice order¶
We will assess the performance of the BNLearner
module to uncover the causal structure of the dataset. A slice order will be imposed on the variables to guide the learning process. Following the identification of the structure via the Structure Learning algorithm, we will proceed to estimate the causal effect based on the inferred structure.
template = discretizer.discretizedTemplate(df)
structure_learner = gum.BNLearner(df, template)
structure_learner.useNMLCorrection()
structure_learner.useSmoothingPrior(1e-9)
structure_learner.setSliceOrder([["X3","X1","X2","V1","V2","V3"], ["T"], ["Y"]])
learned_bn = structure_learner.learnBN()
learned_causal_model = csl.CausalModel(learned_bn)
print(structure_learner)
Filename : C:\Users\phw\AppData\Local\Temp\tmpsjhxalx8.csv Size : (100000,8) Variables : X1[10], X2[10], X3[2], T[2], V1[10], V2[10], V3[2], Y[60] Induced types : False Missing values : False Algorithm : MIIC Correction : NML Prior : Smoothing Prior weight : 0.000000 Constraint Slice Order : {T:1, X2:0, V3:0, V1:0, Y:2, X3:0, X1:0, V2:0}
gnb.sideBySide(
gexpl.getInformation(learned_bn, size="50"),
gnb.getInference(learned_bn, size="50")
)
Subsequently, we will reuse the previously established pipeline to estimate the causal effect based on the identified structure.
cee = csl.CausalEffectEstimation(df, learned_causal_model)
cee.identifyAdjustmentSet(intervention="T", outcome="Y")
Backdoor adjustment found. Supported estimators include: - CausalModelEstimator - SLearner - TLearner - XLearner - PStratification - IPW
'Backdoor'
cee.fitCausalBNEstimator()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2*100)} %")
ACE linear = 1.2682577474049224, MAPE = 36.58711262975388 %
cee.fitSLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %")
ACE linear = 2.003169942031878, MAPE = 0.15849710159390185 %
cee.fitTLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %")
ACE linear = 1.9978358695427927, MAPE = 0.1082065228603657 %
cee.fitIPW()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %")
ACE linear = 2.0917446591374964, MAPE = 4.5872329568748205 %
We observe that the results remain largely consistent when imposing a slice order on the variables. However, the CausalBNEstimator
appears to be the most sensitive to structural changes, as it relies on the entire graph structure for its estimations.
3.6 Structure Learning with imposed slice order¶
Next, we will evaluate the estimations produced when allowing the algorithm to perform structure learning autonomously, without imposing any slice order.
template = discretizer.discretizedTemplate(df)
structure_learner = gum.BNLearner(df, template)
structure_learner.useNMLCorrection()
structure_learner.useSmoothingPrior(1e-9)
learned_bn = structure_learner.learnBN()
learned_causal_model = csl.CausalModel(learned_bn)
print(structure_learner)
Filename : C:\Users\phw\AppData\Local\Temp\tmp87r_07ne.csv Size : (100000,8) Variables : X1[10], X2[10], X3[2], T[2], V1[10], V2[10], V3[2], Y[60] Induced types : False Missing values : False Algorithm : MIIC Correction : NML Prior : Smoothing Prior weight : 0.000000
gnb.sideBySide(
gexpl.getInformation(learned_bn, size="50"),
gnb.getInference(learned_bn, size="50")
)
We are encountering challenges at the outset, as numerous causal relationships are misspecified, with multiple covariates incorrectly identified as causes of the outcome. Despite this, the intervention variable is correctly identified as a cause of the outcome. With limited expectations, let us now examine the estimation results.
cee = csl.CausalEffectEstimation(df, learned_causal_model)
cee.identifyAdjustmentSet(intervention="T", outcome="Y")
Randomized Controlled Trial adjustment found. Supported estimators include: - CausalModelEstimator - DM If the outcome variable is a cause of other covariates in the causal graph, Backdoor estimators may also be used.
'Randomized Controlled Trial'
cee.fitCausalBNEstimator()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2*100)} %")
ACE linear = 0.8778137094311345, MAPE = 56.10931452844328 %
cee.fitDM()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %")
ACE linear = -2.8266585405983644, MAPE = 241.3329270299182 %
The results are exceptionally poor. In an attempt to address this issue, we will investigate whether controlling for covariates that are identified as causes of the outcome in the causal graph improves the estimation.
cee.useBackdoorAdjustment(intervention="T", outcome="Y", confounders={"V1", "V2"})
cee.fitSLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %")
ACE linear = 1.0175205331546058, MAPE = 49.123973342269714 %
cee.fitTLearner()
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %")
ACE linear = 1.0490841845358503, MAPE = 47.545790773207486 %
One might assume that the estimation methods are not sufficiently sophisticated, given that the default meta-learners utilize Linear Regression as the base model. To address this, we will assess whether employing XGBRegressor
as the base learner within the XLearner
framework improves the estimation results.
cee.fitXLearner(learner="XGBRegressor")
tau_hat = cee.estimateCausalEffect()
print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %")
--------------------------------------------------------------------------- NameError Traceback (most recent call last) File ~\scoop\apps\python\current\Lib\site-packages\pyAgrum\causal\causalEffectEstimation\_learners.py:85, in learnerFromString(learner_string) 84 case "XGBRegressor": ---> 85 return xgboost.XGBRegressor() 86 case "XGBClassifier": NameError: name 'xgboost' is not defined During handling of the above exception, another exception occurred: NameError Traceback (most recent call last) Cell In[55], line 1 ----> 1 cee.fitXLearner(learner="XGBRegressor") 2 tau_hat = cee.estimateCausalEffect() 4 print(f"ACE linear = {tau_hat}, MAPE = {abs((tau_hat-2)/2)*100} %") File ~\scoop\apps\python\current\Lib\site-packages\pyAgrum\causal\causalEffectEstimation\_CausalEffectEstimation.py:683, in CausalEffectEstimation.fitXLearner(self, **estimator_params) 680 if self._X is None or len(self._X) == 0: 681 raise RCTError("XLearner") --> 683 self._estimator = XLearner(**estimator_params) 684 self._fitEstimator() File ~\scoop\apps\python\current\Lib\site-packages\pyAgrum\causal\causalEffectEstimation\_backdoorEstimators.py:344, in XLearner.__init__(self, learner, control_outcome_learner, treatment_outcome_learner, control_effect_learner, treatment_effect_learner, propensity_score_learner) 341 self.treatment_effect_learner = learnerFromString( 342 "LinearRegression") 343 elif isinstance(learner, str): --> 344 self.control_outcome_learner = learnerFromString(learner) 345 self.treatment_outcome_learner = learnerFromString(learner) 346 self.control_effect_learner = learnerFromString(learner) File ~\scoop\apps\python\current\Lib\site-packages\pyAgrum\causal\causalEffectEstimation\_learners.py:95, in learnerFromString(learner_string) 93 raise sklearn_import_error 94 if error.name == "xgboost" and not found_xgboost: ---> 95 raise xgboost_import_error NameError: name 'xgboost_import_error' is not defined
The results show only marginal improvement, insufficient to draw reliable conclusions about the true causal effect.
This underscores the critical role of domain knowledge in accurately specifying the underlying causal structure. Even the most sophisticated estimation methods can falter when the causal structure is incorrectly specified.
3.7 Remarks¶
Overall, we observe that Bayesian Network-based estimators consistently underestimate the ACE. This underestimation arises due to the discretization and learning processes, which tend to homogenize the data by averaging it out. Since the ACE is computed as the difference between two expectations, this induced homogeneity reduces the variance between the groups, leading to a smaller difference and consequently an underestimation of the ACE.