Merge pull request #447 from uber/huigang/doc_update

Documentation Update on Instrument Variable Methods
uber · Dec 30, 2021 · 8ba7528 · 8ba7528
2 parents 1365562 + 4677a52
commit 8ba7528
Show file tree

Hide file tree

Showing 4 changed files with 157 additions and 14 deletions.
diff --git a/docs/about.rst b/docs/about.rst
@@ -21,11 +21,10 @@ The package currently supports the following methods:
     - :ref:`T-learner`
     - :ref:`X-learner`
     - :ref:`R-learner`
-    - Doubly Robust (DR) learner
-    - TMLE learner
+    - :ref:`Doubly Robust (DR) learner`
 - Instrumental variables algorithms
-    - 2-Stage Least Squares (2SLS)
-    - Doubly Robust (DR) IV
+    - :ref:`2-Stage Least Squares (2SLS)`
+    - :ref:`Doubly Robust Instrumental Variable (DRIV) learner`
 - Neural network based algorithms
     - CEVAE
     - DragonNet

diff --git a/docs/methodology.rst b/docs/methodology.rst
@@ -110,6 +110,50 @@ Estimate treatment effects by minimising the R-loss, :math:`\hat{L}_n(\tau(x))`:
 
 where :math:`\hat{e}^{(-i)}(X_i)`, etc. denote the out-of-fold held-out predictions made without using the :math:`i`-th training sample.
 
+Doubly Robust (DR) learner
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DR-learner :cite:`kennedy2020optimal` estiamtes the CATE via cross-fitting a doubly-robust score function in two stages as follows. We start by randomly split the data :math:`\{Y, X, W\}` into 3 partitions :math:`\{Y^i, X^i, W^i\}, i=\{1,2,3\}`.
+
+**Stage 1**
+
+Fit a propensity score model :math:`\hat{e}(x)` with machine learning using :math:`\{X^1, W^1\}`, and fit outcome regression models :math:`\hat{m}_0(x)` and :math:`\hat{m}_1(x)` for treated and untreated users with machine learning using :math:`\{Y^2, X^2, W^2\}`.
+
+**Stage 2**
+
+Use machine learning to fit the CATE model, :math:`\hat{\tau}(X)` from the pseudo-outcome
+
+.. math::
+   \phi = \frac{W-\hat{e}(X)}{\hat{e}(X)(1-\hat{e}(X))}\left(Y-\hat{m}_W(X)\right)+\hat{m}_1(X)-\hat{m}_0(X)
+
+with :math:`\{Y^3, X^3, W^3\}`
+
+**Stage 3**
+
+Repeat Stage 1 and Stage 2 again twice. First use :math:`\{Y^2, X^2, W^2\}`, :math:`\{Y^3, X^3, W^3\}`, and :math:`\{Y^1, X^1, W^1\}` for the propensity score model, the outcome models, and the CATE model. Then use :math:`\{Y^3, X^3, W^3\}`, :math:`\{Y^2, X^2, W^2\}`, and :math:`\{Y^1, X^1, W^1\}` for the propensity score model, the outcome models, and the CATE model. The final CATE model is the average of the 3 CATE models.
+
+Doubly Robust Instrumental Variable (DRIV) learner
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We combine the idea from DR-learner :cite:`kennedy2020optimal` with the doubly robust score function for LATE described in :cite:`10.1111/ectj.12097` to estimate the conditional LATE. Towards that end, we start by randomly split the data :math:`\{Y, X, W, Z\}` into 3 partitions :math:`\{Y^i, X^i, W^i, Z^i\}, i=\{1,2,3\}`.
+
+**Stage 1**
+
+Fit propensity score models :math:`\hat{e}_0(x)` and :math:`\hat{e}_1(x)` for assigned and unassigned users using :math:`\{X^1, W^1, Z^1\}`, and fit outcome regression models :math:`\hat{m}_0(x)` and :math:`\hat{m}_1(x)` for assigned and unassigned users with machine learning using :math:`\{Y^2, X^2, Z^2\}`. Assignment probabiliy, :math:`p_Z`, can either be user provided or come from a simple model, since in most use cases assignment is random by design.
+
+**Stage 2**
+
+Use machine learning to fit the conditional :ref:`LATE` model, :math:`\hat{\tau}(X)` by minimizing the following loss function
+
+.. math::
+   L(\hat{\tau}(X)) = \hat{E} &\left[\left(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z} \right.\right.\\
+   &\left.\left.\quad -\Big(\hat{e}_1(X)-\hat{e}_0(X)+\frac{Z(W-\hat{e}_1(X))}{p_Z}-\frac{(1-Z)(W-\hat{e}_0(X))}{1-p_Z}\Big) \hat{\tau}(X) \right)^2\right]
+
+with :math:`\{Y^3, X^3, W^3\}`
+
+**Stage 3**
+
+Similar to the DR-Learner Repeat Stage 1 and Stage 2 again twice with different permutations of partitions for estimation. The final conditional LATE model is the average of the 3 conditional LATE models.
 
 Tree-Based Algorithms
 ---------------------
@@ -251,12 +295,74 @@ In this way, the IPTW approach can be seen as creating an artificial population
 
 One of the possible benefits of IPTW compared to matching is that less data may be discarded due to lack of overlap between treated and non-treated units. A known problem with the approach is that extreme propensity scores can generate highly variable estimators. Different methods have been proposed for trimming and normalizing the IPT weights (:cite:`https://doi.org/10.1111/1468-0262.00442`). An overview of the IPTW approach can be found in :cite:`https://doi.org/10.1002/sim.6607`.
 
-Instrumental variables
-~~~~~~~~~~~~~~~~~~~~~~
+2-Stage Least Squares (2SLS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The instrumental variables approach attempts to estimate the effect of :math:`W` on :math:`Y` with the help of a third variable :math:`Z` that is correlated with :math:`W` but is uncorrelated with the error term for :math:`Y`. In other words, the instrument :math:`Z` is only related with :math:`Y` through the directed path that goes through :math:`W`. If these conditions are satisfied, the effect of :math:`W` on :math:`Y` can be estimated using the sample analog of:
+One of the basic requirements for identifying the treatment effect of :math:`W` on :math:`Y` is that :math:`W` is orthogonal to the potential outcome of :math:`Y`, conditional on the covariates :math:`X`. This may be violated if both :math:`W` and :math:`Y` are affected by an unobserved variable, the error term after removing the true effect of :math:`W` from :math:`Y`, that is not in :math:`X`. In this case, the instrumental variables approach attempts to estimate the effect of :math:`W` on :math:`Y` with the help of a third variable :math:`Z` that is correlated with :math:`W` but is uncorrelated with the error term. In other words, the instrument :math:`Z` is only related with :math:`Y` through the directed path that goes through :math:`W`. If these conditions are satisfied, in the case without covariates, the effect of :math:`W` on :math:`Y` can be estimated using the sample analog of:
 
 .. math::
    \frac{Cov(Y_i, Z_i)}{Cov(W_i, Z_i)}
 
-The most common method for instrumental variables estimation is the two-stage least squares (2SLS). In this approach, the cause variable :math:`W` is first regressed on the instrument :math:`Z`. Then, in the second stage, the outcome of interest :math:`Y` is regressed on the predicted value from the first-stage model. Intuitively, the effect of :math:`W` on :math:`Y` is estimated by using only the proportion of variation in :math:`W` due to variation in :math:`Z`. See :cite:`10.1257/jep.15.4.69` for a detailed discussion of the method.
+The most common method for instrumental variables estimation is the two-stage least squares (2SLS). In this approach, the cause variable :math:`W` is first regressed on the instrument :math:`Z`. Then, in the second stage, the outcome of interest :math:`Y` is regressed on the predicted value from the first-stage model. Intuitively, the effect of :math:`W` on :math:`Y` is estimated by using only the proportion of variation in :math:`W` due to variation in :math:`Z`. Specifically, assume that we have the linear model
+
+.. math::
+   Y = W \alpha + X \beta + u = \Xi \gamma + u
+
+Here for convenience we let :math:`\Xi=[W, X]` and :math:`\gamma=[\alpha', \beta']'`. Assume that we have instrumental variables :math:`Z` whose number of columns is at least the number of columns of :math:`W`, let :math:`\Omega=[Z, X]`, 2SLS estimator is as follows
+
+.. math::
+   \hat{\gamma}_{2SLS} = \left[\Xi'\Omega (\Omega'\Omega)^{-1} \Omega' \Xi\right]^{-1}\left[\Xi'\Omega'(\Omega'\Omega)^{-1}\Omega'Y\right].
+
+See :cite:`10.1257/jep.15.4.69` for a detailed discussion of the method.
+
+LATE
+~~~~
+
+In many situations the treatment :math:`W` may depend on user's own choice and cannot be administered directly in an experimental setting. However one can randomly assign users into treatment/control groups so that users in the treatment group can be nudged to take the treatment. This is the case of noncompliance, where users may fail to comply with their assignment status, :math:`Z`, as to whether to take treatment or not. Similar to the section of Value optimization methods, in general there are 3 types of users in this situation,
+
+* **Compliers** Those who will take the treatment if and only if they are assigned to the treatment group.
+* **Always-Taker** Those who will take the treatment regardless which group they are assigned to.
+* **Never-Taker** Those who wil not take the treatment regardless which group they are assigned to.
+
+However one assumes that there is no Defier for identification purposes, i.e. those who will only take the treatment if they are assigned to the control group.
+
+In this case one can measure the treatment effect of Compliers,
+
+.. math::
+   \hat{\tau}_{Complier}=\frac{E[Y|Z=1]-E[Y|Z=0]}{E[W|Z=1]-E[W|Z=0]}
+
+This is Local Average Treatment Effect (LATE). The estimator is also equivalent to 2SLS if we take the assignment status, :math:`Z`, as an instrument.
+
+
+Targeted maximum likelihood estimation (TMLE) for ATE
+-----------------------------------------------------
+
+Targeted maximum likelihood estimation (TMLE) :cite:`tmle` provides a doubly robust semiparametric method that "targets" directly on the average treatment effect with the aid from machine learning algorithms. Compared to other methods including outcome regression and inverse probability of treatment weighting, TMLE usually gives better performance especially when dealing with skewed treatment and outliers.
+
+Given binary treatment :math:`W`, covariates :math:`X`, and outcome :math:`Y`, the TMLE for ATE is performed in the following steps
+
+**Step 1**
+
+Use cross fit to estimate the propensity score :math:`\hat{e}(x)`, the predicted outcome for treated :math:`\hat{m}_1(x)`, and predicted outcome for control :math:`\hat{m}_0(x)` with machine learning.
+
+**Step 2**
+
+Scale :math:`Y` into :math:`\tilde{Y}=\frac{Y-\min Y}{\max Y - \min Y}` so that :math:`\tilde{Y} \in [0,1]`. Use the same scale function to transform :math:`\hat{m}_i(x)` into :math:`\tilde{m}_i(x)`, :math:`i=0,1`. Clip the scaled functions so that their values stay in the unit interval.
+
+**Step 3**
+
+Let :math:`Q=\log(\tilde{m}_W(X)/(1-\tilde{m}_W(X)))`. Maximize the following pseudo log-likelihood function
+
+.. math::
+   \max_{h_0, h_1} -\frac{1}{N} \sum_i & \left[ \tilde{Y}_i \log \left(1+\exp(-Q_i-h_0 \frac{1-W}{1-\hat{e}(X_i)}-h_1 \frac{W}{\hat{e}(X_i)} \right) \right. \\
+   &\quad\left.+(1-\tilde{Y}_i)\log\left(1+\exp(Q_i+h_0\frac{1-W}{1-\hat{e}(X_i)}+h_1\frac{W}{\hat{e}(X_i)}\right)\right]
+
+**Step 4**
+
+Let
+
+.. math::
+   \tilde{Q}_0 &= \frac{1}{1+\exp\left(-Q-h_0 \frac{1}{1-\hat{e}(X)}\right)},\\
+   \tilde{Q}_1 &= \frac{1}{1+\exp\left(-Q-h_1 \frac{1}{\hat{e}(X)}\right)}.
+
+The ATE estimate is the sample average of the differences of :math:`\tilde{Q}_1` and :math:`\tilde{Q}_0` after rescale to the original range.
diff --git a/docs/refs.bib b/docs/refs.bib
@@ -308,7 +308,7 @@ @inproceedings{ijcai2019-248
   author    = {Li, Ang and Pearl, Judea},
   booktitle = {Proceedings of the Twenty-Eighth International Joint Conference on
                Artificial Intelligence, {IJCAI-19}},
-  publisher = {International Joint Conferences on Artificial Intelligence Organization},             
+  publisher = {International Joint Conferences on Artificial Intelligence Organization},
   pages     = {1793--1799},
   year      = {2019},
   month     = {7},
@@ -414,3 +414,39 @@ @article{zhao2020feature
   journal={arXiv preprint arXiv:2005.03447},
   year={2020}
 }
+
+@misc{kennedy2020optimal,
+      title={Optimal doubly robust estimation of heterogeneous causal effects},
+      author={Edward H. Kennedy},
+      year={2020},
+      eprint={2004.14497},
+      archivePrefix={arXiv},
+      primaryClass={math.ST}
+}
+
+@article{10.1111/ectj.12097,
+    author = {Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther and Hansen, Christian and Newey, Whitney and Robins, James},
+    title = "{Double/debiased machine learning for treatment and structural parameters}",
+    journal = {The Econometrics Journal},
+    volume = {21},
+    number = {1},
+    pages = {C1-C68},
+    year = {2018},
+    month = {01},
+    abstract = "{We revisit the classic semi‐parametric problem of inference on a low‐dimensional parameter θ0 in the presence of high‐dimensional nuisance parameters η0. We depart from the classical setting by allowing for η0 to be so high‐dimensional that the traditional assumptions (e.g. Donsker properties) that limit complexity of the parameter space for this object break down. To estimate η0, we consider the use of statistical or machine learning (ML) methods, which are particularly well suited to estimation in modern, very high‐dimensional cases. ML methods perform well by employing regularization to reduce variance and trading off regularization bias with overfitting in practice. However, both regularization bias and overfitting in estimating η0 cause a heavy bias in estimators of θ0 that are obtained by naively plugging ML estimators of η0 into estimating equations for θ0. This bias results in the naive estimator failing to be N−1/2 consistent, where N is the sample size. We show that the impact of regularization bias and overfitting on estimation of the parameter of interest θ0 can be removed by using two simple, yet critical, ingredients: (1) using Neyman‐orthogonal moments/scores that have reduced sensitivity with respect to nuisance parameters to estimate θ0; (2) making use of cross‐fitting, which provides an efficient form of data‐splitting. We call the resulting set of methods double or debiased ML (DML). We verify that DML delivers point estimators that concentrate in an N−1/2‐neighbourhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements. The generic statistical theory of DML is elementary and simultaneously relies on only weak theoretical requirements, which will admit the use of a broad array of modern ML methods for estimating the nuisance parameters, such as random forests, lasso, ridge, deep neural nets, boosted trees, and various hybrids and ensembles of these methods. We illustrate the general theory by applying it to provide theoretical properties of the following: DML applied to learn the main regression parameter in a partially linear regression model; DML applied to learn the coefficient on an endogenous variable in a partially linear instrumental variables model; DML applied to learn the average treatment effect and the average treatment effect on the treated under unconfoundedness; DML applied to learn the local average treatment effect in an instrumental variables setting. In addition to these theoretical applications, we also illustrate the use of DML in three empirical examples.}",
+    issn = {1368-4221},
+    doi = {10.1111/ectj.12097},
+    url = {https://doi.org/10.1111/ectj.12097},
+    eprint = {https://academic.oup.com/ectj/article-pdf/21/1/C1/27684918/ectj00c1.pdf},
+}
+
+@book{tmle,
+author = {Laan, Mark and Rose, Sherri},
+year = {2011},
+month = {01},
+pages = {},
+title = {Targeted Learning: Causal Inference for Observational and Experimental Data},
+publisher={Springer-Verlag New York},
+isbn = {978-1-4419-9781-4},
+doi = {10.1007/978-1-4419-9782-1}
+}
diff --git a/docs/validation.rst b/docs/validation.rst
@@ -92,7 +92,7 @@ Mechanism 4
 Validation with Uplift Curve (AUUC)
 ----------------------------------
 
-We can validate the estimation by evaluating and comparing the uplift gains with AUUC (Area Under Uplift Curve), it calculates cumulative gains and please find more details in `meta_learners_with_synthetic_data.ipynb example notebook <https://github.com/uber/causalml/blob/master/examples/meta_learners_with_synthetic_data.ipynb>`_.
+We can validate the estimation by evaluating and comparing the uplift gains with AUUC (Area Under Uplift Curve), it calculates cumulative gains. Please find more details in `meta_learners_with_synthetic_data.ipynb example notebook <https://github.com/uber/causalml/blob/master/examples/meta_learners_with_synthetic_data.ipynb>`_.
 
 .. code-block:: python
 
@@ -112,6 +112,8 @@ We can validate the estimation by evaluating and comparing the uplift gains with
 .. image:: ./_static/img/auuc_vis.png
     :width: 629
 
+For data with skewed treatment, it is sometimes advantageous to use :ref:`Targeted maximum likelihood estimation (TMLE) for ATE` to generate the AUUC curve for validation, as TMLE provides a more accurate estimation of ATE. Please find `validation_with_tmle.ipynb example notebook <https://github.com/uber/causalml/blob/master/examples/validation_with_tmle.ipynb>`_ for details.
+
 Validation with Sensitivity Analysis
 ----------------------------------
 Sensitivity analysis aim to check the robustness of the unconfoundeness assumption. If there is hidden bias (unobserved confounders), it detemineds how severe whould have to be to change conclusion by examine the average treatment effect estimation.
@@ -142,9 +144,9 @@ Selection Bias
 ~~~~~~~~~~~~~~
 
 | `Blackwell(2013) <https://www.mattblackwell.org/files/papers/sens.pdf>` introduced an approach to sensitivity analysis for causal effects that directly models confounding or selection bias.
-| 
-| One Sided Confounding Function: here as the name implies, this function can detect sensitivity to one-sided selection bias, but it would fail to detect other deviations from ignobility. That is, it can only determine the bias resulting from the treatment group being on average better off or the control group being on average better off. 
-| 
+|
+| One Sided Confounding Function: here as the name implies, this function can detect sensitivity to one-sided selection bias, but it would fail to detect other deviations from ignobility. That is, it can only determine the bias resulting from the treatment group being on average better off or the control group being on average better off.
+|
 | Alignment Confounding Function: this type of bias is likely to occur when units select into treatment and control based on their predicted treatment effects
-| 
+|
 | The sensitivity analysis is rigid in this way because the confounding function is not identified from the data, so that the causal model in the last section is only identified conditional on a specific choice of that function. The goal of the sensitivity analysis is not to choose the “correct” confounding function, since we have no way of evaluating this correctness. By its very nature, unmeasured confounding is unmeasured. Rather, the goal is to identify plausible deviations from ignobility and test sensitivity to those deviations. The main harm that results from the incorrect specification of the confounding function is that hidden biases remain hidden.