Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Causal Forest DML has very wide confidence interval #893

Open
silulyu opened this issue Jul 1, 2024 · 1 comment
Open

Causal Forest DML has very wide confidence interval #893

silulyu opened this issue Jul 1, 2024 · 1 comment

Comments

@silulyu
Copy link

silulyu commented Jul 1, 2024

Below is my code to estimate treatment effects. There is a much wider confidence interval of ATT (i.e., [-200k, 900k]) by Causal Forest DML model, compared to that calculated by linear DML model (i.e, [200k, 400k]). Are there any ways to make CI by Causal Forest DML narrower, and ideally statistically significant ?

# Linear DML for the ATE
   dml = LinearDML(
       model_y = rf_reg,
       model_t = xgb_class, 
       discrete_treatment = True,
       random_state = 0,
       cv = StratifiedKFold(5))

   print('Fitting linear DML...')
   results_dml = dml.fit(Y=Y,T=T,W=X)
   ate = dml.intercept_
   ate_lb = dml.intercept__interval()[0]
   ate_ub = dml.intercept__interval()[1]
   
   print('DML ATE:', round(ate, 2), 'CI [', round(ate_lb, 2), ',', round(ate_ub, 2), ']', '$')

# Causal Forest for the ITE
   cf = CausalForestDML(
       model_y = rf_reg,
       model_t = xgb_class, 
       discrete_treatment = True,
       cv=StratifiedKFold(5),
       random_state = 0,
       n_estimators=300,
   )

   print('Fitting causal forest ...')
   results = cf.fit(Y=Y,T=T,X=X,cache_values=True)

   # ITE Estimates with lower and upper bound
   ite_estimates = cf.effect(X)
   lb_estimates, ub_estimates = cf.effect_interval(X)

   # Create DataFrame with individual ITE estimates, lower bound, and upper bound
   all_individual_effects_df = pd.DataFrame({
       'ITE': ite_estimates,
       'ITE_lb': lb_estimates,
       'ITE_ub': ub_estimates
   }, index=df.index)

   # Concatenate with other relevant data
   all_ITEs = pd.concat([df[['sfdc_customer_id']], T, Y, all_individual_effects_df], axis=1)
   
   # Calculate ATT for treatment group
   att = all_ITEs[all_ITEs[treatment_variable] == 1]['ITE'].mean()
   att_lb = all_ITEs[all_ITEs[treatment_variable] == 1]['ITE_lb'].mean()
   att_ub = all_ITEs[all_ITEs[treatment_variable] == 1]['ITE_ub'].mean()

   # Print ATT results
   print('CF ATT:', round(att, 2), 'CI [', round(att_lb, 2), ',', round(att_ub, 2), ']', '$')
@kbattocchi
Copy link
Collaborator

One thing you can try is to use the tune method on the forest before fitting, which should help you set appropriate hyperparameters. Also, you might want to do some model selection for your first stage models to ensure that you're getting the best possible first-stage fits.

However, in general you should expect the confidence intervals for forest-based methods to be wider than those for linear regression - the linear model is much more restrictive and therefore easier to estimate. But keep in mind that the confidence intervals are assuming that the assumptions of the model are met, which means that if the true data-generating process is not linear, then those tighter bounds are not necessarily correct!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants