From 6297942531ed76fb9715560772e25a5d6caa1e46 Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Tue, 28 Dec 2021 15:23:31 -0800
Subject: [PATCH 01/15] Add documentations for DR learner and DRIV learner

---
 docs/methodology.rst | 56 ++++++++++++++++++++++++++++++++++++++++++++
 docs/refs.bib        | 27 ++++++++++++++++++++-
 2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index a688dfe2..709631b8 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -110,6 +110,27 @@ Estimate treatment effects by minimising the R-loss, :math:`\hat{L}_n(\tau(x))`:
 
 where :math:`\hat{e}^{(-i)}(X_i)`, etc. denote the out-of-fold held-out predictions made without using the :math:`i`-th training sample.
 
+Doubly Robust (DR) learner
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DR-learner :cite:`kennedy2020optimal` estiamtes the CATE via cross-fitting a doubly-robust score function in two stages as follows. We start by randomly split the data :math:`\{Y, X, W\}` into 3 partitions :math:`\{Y^i, X^i, W^i\}, i=\{1,2,3\}`.
+
+**Stage 1**
+
+Fit a propensity score model :math:`\hat{e}(x)` with machine learning using :math:`\{X^1, W^1\}`, and fit outcome regression models :math:`\hat{m}_0(x)` and :math:`\hat{m}_1(x)` for treated and untreated users with machine learning using :math:`\{Y^2, X^2, W^2\}`.
+
+**Stage 2**
+
+Use machine learning to fit the CATE model, :math:`\hat{\tau}(X)` from the pseudo-outcome
+
+..math::
+   \phi = \frac{W-\hat{e}(X)}{\hat{e}(X)(1-\hat{e}(X))}\left(Y-\hat{m}_W(X))+\hat{m}_1(X)-\hat{m}_0(X)
+
+with :math:`\{Y^3, X^3, W^3\}`
+
+**Stage 3**
+
+Repeat Stage 1 and Stage 2 again twice. First use :math:`\{Y^2, X^2, W^2\}`, :math:`\{Y^3, X^3, W^3\}`, and :math:`\{Y^1, X^1, W^1\}` for the propensity score model, the outcome models, and the CATE model. Then use :math:`\{Y^3, X^3, W^3\}`, :math:`\{Y^2, X^2, W^2\}`, and :math:`\{Y^1, X^1, W^1\}` for the propensity score model, the outcome models, and the CATE model. The final CATE model is the average of the 3 CATE models.
 
 Tree-Based Algorithms
 ---------------------
@@ -260,3 +281,38 @@ The instrumental variables approach attempts to estimate the effect of :math:`W`
    \frac{Cov(Y_i, Z_i)}{Cov(W_i, Z_i)}
 
 The most common method for instrumental variables estimation is the two-stage least squares (2SLS). In this approach, the cause variable :math:`W` is first regressed on the instrument :math:`Z`. Then, in the second stage, the outcome of interest :math:`Y` is regressed on the predicted value from the first-stage model. Intuitively, the effect of :math:`W` on :math:`Y` is estimated by using only the proportion of variation in :math:`W` due to variation in :math:`Z`. See :cite:`10.1257/jep.15.4.69` for a detailed discussion of the method.
+
+In many situations the treatment, :math:`W`, cannot be administered directly in an experimental setting. However one can randomly assign subjects into treatment/control groups so that subjects in the treatment group can be nudged to take the treatment. This is the case of noncompliance, where subjects may fail to comply with their assignment status, :math:`Z`, as to whether to take treatment or not. Similar to the section of Value optimization methods, in general there are 3 types of subjects in this situation,
+* **Compliers** Those who will take the treatment if and only if they are assigned to the treatment group.
+* **Always-Taker** Those who will take the treatment regardless which group they are assigned to.
+* **Never-Taker** Those who wil not take the treatment regardless which group they are assigned to.
+However one assumes that there is no Defier for identification purposes, i.e. those who will only take the treatment if they are assigned to the control group.
+
+In this case one can measure the treatment effect of Compliers,
+
+.. math::
+   \hat{\tau_{Complier}}=\frac{E[Y|Z=1]-E[Y|Z=0]}{E[W|Z=1]-E[W|Z=0]}
+
+This is Local Average Treatment Effect (LATE). The estimator is also equivalent to 2SLS if we take the assignment status, :math:`Z`, as an instrument.
+
+Doubly Robust Instrumental Variable (DRIV) learner
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We combine the idea from DR-learner :cite:`kennedy2020optimal` with the doubly robust score function for LATE described in :cite:`10.1111/ectj.12097` to estimate the conditional LATE. Towards that end, we start by randomly split the data :math:`\{Y, X, W, Z\}` into 3 partitions :math:`\{Y^i, X^i, W^i, Z^i\}, i=\{1,2,3\}`.
+
+**Stage 1**
+
+Fit propensity score models :math:`\hat{e}_0(x)` and :math:`\hat{e}_1(x)` for assigned and unassigned users using :math:`\{X^1, W^1, Z^1\}`, and fit outcome regression models :math:`\hat{m}_0(x)` and :math:`\hat{m}_1(x)` for assigned and unassigned users with machine learning using :math:`\{Y^2, X^2, Z^2\}`. Assignment probabiliy, :math:`p_Z`, can either be user provided or come from a simple model, since in most use cases assignment is random by design.
+
+**Stage 2**
+
+Use machine learning to fit the conditional LATE model, :math:`\hat{\tau}(X)` by minimizing the following loss function
+
+..math::
+   L(\hat{\tau}(X)) = \hat{E} \big[\big(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z}-\big(\hat{e}_1(X)-\hat{e}_0(X)+\frac{Z(W-\hat{e}_1(X))}{p_Z}-\frac{(1-Z)(W-\hat{e}_0(X))}{1-p_Z}\big) \hat{tau}(X) \big)^2\big]
+
+with :math:`\{Y^3, X^3, W^3\}`
+
+**Stage 3**
+
+Similar to the DR-Leaner Repeat Stage 1 and Stage 2 again twice with different permutations of partitions for estimation. The final conditional LATE model is the average of the 3 conditional LATE models.
diff --git a/docs/refs.bib b/docs/refs.bib
index f8a038da..04137cd9 100755
--- a/docs/refs.bib
+++ b/docs/refs.bib
@@ -308,7 +308,7 @@ @inproceedings{ijcai2019-248
   author    = {Li, Ang and Pearl, Judea},
   booktitle = {Proceedings of the Twenty-Eighth International Joint Conference on
                Artificial Intelligence, {IJCAI-19}},
-  publisher = {International Joint Conferences on Artificial Intelligence Organization},             
+  publisher = {International Joint Conferences on Artificial Intelligence Organization},
   pages     = {1793--1799},
   year      = {2019},
   month     = {7},
@@ -414,3 +414,28 @@ @article{zhao2020feature
   journal={arXiv preprint arXiv:2005.03447},
   year={2020}
 }
+
+@misc{kennedy2020optimal,
+      title={Optimal doubly robust estimation of heterogeneous causal effects},
+      author={Edward H. Kennedy},
+      year={2020},
+      eprint={2004.14497},
+      archivePrefix={arXiv},
+      primaryClass={math.ST}
+}
+
+@article{10.1111/ectj.12097,
+    author = {Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther and Hansen, Christian and Newey, Whitney and Robins, James},
+    title = "{Double/debiased machine learning for treatment and structural parameters}",
+    journal = {The Econometrics Journal},
+    volume = {21},
+    number = {1},
+    pages = {C1-C68},
+    year = {2018},
+    month = {01},
+    abstract = "{We revisit the classic semi‐parametric problem of inference on a low‐dimensional parameter θ0 in the presence of high‐dimensional nuisance parameters η0. We depart from the classical setting by allowing for η0 to be so high‐dimensional that the traditional assumptions (e.g. Donsker properties) that limit complexity of the parameter space for this object break down. To estimate η0, we consider the use of statistical or machine learning (ML) methods, which are particularly well suited to estimation in modern, very high‐dimensional cases. ML methods perform well by employing regularization to reduce variance and trading off regularization bias with overfitting in practice. However, both regularization bias and overfitting in estimating η0 cause a heavy bias in estimators of θ0 that are obtained by naively plugging ML estimators of η0 into estimating equations for θ0. This bias results in the naive estimator failing to be N−1/2 consistent, where N is the sample size. We show that the impact of regularization bias and overfitting on estimation of the parameter of interest θ0 can be removed by using two simple, yet critical, ingredients: (1) using Neyman‐orthogonal moments/scores that have reduced sensitivity with respect to nuisance parameters to estimate θ0; (2) making use of cross‐fitting, which provides an efficient form of data‐splitting. We call the resulting set of methods double or debiased ML (DML). We verify that DML delivers point estimators that concentrate in an N−1/2‐neighbourhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements. The generic statistical theory of DML is elementary and simultaneously relies on only weak theoretical requirements, which will admit the use of a broad array of modern ML methods for estimating the nuisance parameters, such as random forests, lasso, ridge, deep neural nets, boosted trees, and various hybrids and ensembles of these methods. We illustrate the general theory by applying it to provide theoretical properties of the following: DML applied to learn the main regression parameter in a partially linear regression model; DML applied to learn the coefficient on an endogenous variable in a partially linear instrumental variables model; DML applied to learn the average treatment effect and the average treatment effect on the treated under unconfoundedness; DML applied to learn the local average treatment effect in an instrumental variables setting. In addition to these theoretical applications, we also illustrate the use of DML in three empirical examples.}",
+    issn = {1368-4221},
+    doi = {10.1111/ectj.12097},
+    url = {https://doi.org/10.1111/ectj.12097},
+    eprint = {https://academic.oup.com/ectj/article-pdf/21/1/C1/27684918/ectj00c1.pdf},
+}

From a3477efbf426d7173df87371e8e4ea1dee6ef9c1 Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Tue, 28 Dec 2021 15:45:10 -0800
Subject: [PATCH 02/15] Format corrections

---
 docs/methodology.rst | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index 709631b8..43da0290 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -123,7 +123,7 @@ Fit a propensity score model :math:`\hat{e}(x)` with machine learning using :mat
 
 Use machine learning to fit the CATE model, :math:`\hat{\tau}(X)` from the pseudo-outcome
 
-..math::
+.. math::
    \phi = \frac{W-\hat{e}(X)}{\hat{e}(X)(1-\hat{e}(X))}\left(Y-\hat{m}_W(X))+\hat{m}_1(X)-\hat{m}_0(X)
 
 with :math:`\{Y^3, X^3, W^3\}`
@@ -283,9 +283,11 @@ The instrumental variables approach attempts to estimate the effect of :math:`W`
 The most common method for instrumental variables estimation is the two-stage least squares (2SLS). In this approach, the cause variable :math:`W` is first regressed on the instrument :math:`Z`. Then, in the second stage, the outcome of interest :math:`Y` is regressed on the predicted value from the first-stage model. Intuitively, the effect of :math:`W` on :math:`Y` is estimated by using only the proportion of variation in :math:`W` due to variation in :math:`Z`. See :cite:`10.1257/jep.15.4.69` for a detailed discussion of the method.
 
 In many situations the treatment, :math:`W`, cannot be administered directly in an experimental setting. However one can randomly assign subjects into treatment/control groups so that subjects in the treatment group can be nudged to take the treatment. This is the case of noncompliance, where subjects may fail to comply with their assignment status, :math:`Z`, as to whether to take treatment or not. Similar to the section of Value optimization methods, in general there are 3 types of subjects in this situation,
+
 * **Compliers** Those who will take the treatment if and only if they are assigned to the treatment group.
 * **Always-Taker** Those who will take the treatment regardless which group they are assigned to.
 * **Never-Taker** Those who wil not take the treatment regardless which group they are assigned to.
+
 However one assumes that there is no Defier for identification purposes, i.e. those who will only take the treatment if they are assigned to the control group.
 
 In this case one can measure the treatment effect of Compliers,
@@ -308,7 +310,7 @@ Fit propensity score models :math:`\hat{e}_0(x)` and :math:`\hat{e}_1(x)` for as
 
 Use machine learning to fit the conditional LATE model, :math:`\hat{\tau}(X)` by minimizing the following loss function
 
-..math::
+.. math::
    L(\hat{\tau}(X)) = \hat{E} \big[\big(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z}-\big(\hat{e}_1(X)-\hat{e}_0(X)+\frac{Z(W-\hat{e}_1(X))}{p_Z}-\frac{(1-Z)(W-\hat{e}_0(X))}{1-p_Z}\big) \hat{tau}(X) \big)^2\big]
 
 with :math:`\{Y^3, X^3, W^3\}`

From 0fd37632b4bb6199a3047a4f1d76ef22f3f9d1df Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Tue, 28 Dec 2021 16:35:39 -0800
Subject: [PATCH 03/15] Format corrections

---
 docs/methodology.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index 43da0290..12176969 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -124,7 +124,7 @@ Fit a propensity score model :math:`\hat{e}(x)` with machine learning using :mat
 Use machine learning to fit the CATE model, :math:`\hat{\tau}(X)` from the pseudo-outcome
 
 .. math::
-   \phi = \frac{W-\hat{e}(X)}{\hat{e}(X)(1-\hat{e}(X))}\left(Y-\hat{m}_W(X))+\hat{m}_1(X)-\hat{m}_0(X)
+   \phi = \frac{W-\hat{e}(X)}{\hat{e}(X)(1-\hat{e}(X))}\left(Y-\hat{m}_W(X)\right)+\hat{m}_1(X)-\hat{m}_0(X)
 
 with :math:`\{Y^3, X^3, W^3\}`
 
@@ -311,7 +311,8 @@ Fit propensity score models :math:`\hat{e}_0(x)` and :math:`\hat{e}_1(x)` for as
 Use machine learning to fit the conditional LATE model, :math:`\hat{\tau}(X)` by minimizing the following loss function
 
 .. math::
-   L(\hat{\tau}(X)) = \hat{E} \big[\big(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z}-\big(\hat{e}_1(X)-\hat{e}_0(X)+\frac{Z(W-\hat{e}_1(X))}{p_Z}-\frac{(1-Z)(W-\hat{e}_0(X))}{1-p_Z}\big) \hat{tau}(X) \big)^2\big]
+   L(\hat{\tau}(X)) = \hat{E} \big[\big(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z} \\
+   -\big(\hat{e}_1(X)-\hat{e}_0(X)+\frac{Z(W-\hat{e}_1(X))}{p_Z}-\frac{(1-Z)(W-\hat{e}_0(X))}{1-p_Z}\big) \hat{\tau}(X) \big)^2\big]
 
 with :math:`\{Y^3, X^3, W^3\}`
 

From b817ffc75f5d85e41e7ac9ffd6dc3ca7294107f4 Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Tue, 28 Dec 2021 16:40:10 -0800
Subject: [PATCH 04/15] Format corrections

---
 docs/about.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/about.rst b/docs/about.rst
index da74434e..041ca936 100644
--- a/docs/about.rst
+++ b/docs/about.rst
@@ -21,7 +21,7 @@ The package currently supports the following methods:
     - :ref:`T-learner`
     - :ref:`X-learner`
     - :ref:`R-learner`
-    - Doubly Robust (DR) learner
+    - :ref:`Doubly Robust (DR) learner`
     - TMLE learner
 - Instrumental variables algorithms
     - 2-Stage Least Squares (2SLS)

From 7b2f78160565b5af0b2180a3c1660b2027acd50d Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Tue, 28 Dec 2021 16:42:34 -0800
Subject: [PATCH 05/15] Format corrections

---
 docs/methodology.rst | 48 ++++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index 12176969..a880339e 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -132,6 +132,29 @@ with :math:`\{Y^3, X^3, W^3\}`
 
 Repeat Stage 1 and Stage 2 again twice. First use :math:`\{Y^2, X^2, W^2\}`, :math:`\{Y^3, X^3, W^3\}`, and :math:`\{Y^1, X^1, W^1\}` for the propensity score model, the outcome models, and the CATE model. Then use :math:`\{Y^3, X^3, W^3\}`, :math:`\{Y^2, X^2, W^2\}`, and :math:`\{Y^1, X^1, W^1\}` for the propensity score model, the outcome models, and the CATE model. The final CATE model is the average of the 3 CATE models.
 
+Doubly Robust Instrumental Variable (DRIV) learner
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We combine the idea from DR-learner :cite:`kennedy2020optimal` with the doubly robust score function for LATE described in :cite:`10.1111/ectj.12097` to estimate the conditional LATE. Towards that end, we start by randomly split the data :math:`\{Y, X, W, Z\}` into 3 partitions :math:`\{Y^i, X^i, W^i, Z^i\}, i=\{1,2,3\}`.
+
+**Stage 1**
+
+Fit propensity score models :math:`\hat{e}_0(x)` and :math:`\hat{e}_1(x)` for assigned and unassigned users using :math:`\{X^1, W^1, Z^1\}`, and fit outcome regression models :math:`\hat{m}_0(x)` and :math:`\hat{m}_1(x)` for assigned and unassigned users with machine learning using :math:`\{Y^2, X^2, Z^2\}`. Assignment probabiliy, :math:`p_Z`, can either be user provided or come from a simple model, since in most use cases assignment is random by design.
+
+**Stage 2**
+
+Use machine learning to fit the conditional LATE model, :math:`\hat{\tau}(X)` by minimizing the following loss function
+
+.. math::
+   L(\hat{\tau}(X)) = \hat{E} \big[\big(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z} \\
+   -\big(\hat{e}_1(X)-\hat{e}_0(X)+\frac{Z(W-\hat{e}_1(X))}{p_Z}-\frac{(1-Z)(W-\hat{e}_0(X))}{1-p_Z}\big) \hat{\tau}(X) \big)^2\big]
+
+with :math:`\{Y^3, X^3, W^3\}`
+
+**Stage 3**
+
+Similar to the DR-Leaner Repeat Stage 1 and Stage 2 again twice with different permutations of partitions for estimation. The final conditional LATE model is the average of the 3 conditional LATE models.
+
 Tree-Based Algorithms
 ---------------------
 
@@ -293,29 +316,6 @@ However one assumes that there is no Defier for identification purposes, i.e. th
 In this case one can measure the treatment effect of Compliers,
 
 .. math::
-   \hat{\tau_{Complier}}=\frac{E[Y|Z=1]-E[Y|Z=0]}{E[W|Z=1]-E[W|Z=0]}
+   \hat{\tau}_{Complier}=\frac{E[Y|Z=1]-E[Y|Z=0]}{E[W|Z=1]-E[W|Z=0]}
 
 This is Local Average Treatment Effect (LATE). The estimator is also equivalent to 2SLS if we take the assignment status, :math:`Z`, as an instrument.
-
-Doubly Robust Instrumental Variable (DRIV) learner
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-We combine the idea from DR-learner :cite:`kennedy2020optimal` with the doubly robust score function for LATE described in :cite:`10.1111/ectj.12097` to estimate the conditional LATE. Towards that end, we start by randomly split the data :math:`\{Y, X, W, Z\}` into 3 partitions :math:`\{Y^i, X^i, W^i, Z^i\}, i=\{1,2,3\}`.
-
-**Stage 1**
-
-Fit propensity score models :math:`\hat{e}_0(x)` and :math:`\hat{e}_1(x)` for assigned and unassigned users using :math:`\{X^1, W^1, Z^1\}`, and fit outcome regression models :math:`\hat{m}_0(x)` and :math:`\hat{m}_1(x)` for assigned and unassigned users with machine learning using :math:`\{Y^2, X^2, Z^2\}`. Assignment probabiliy, :math:`p_Z`, can either be user provided or come from a simple model, since in most use cases assignment is random by design.
-
-**Stage 2**
-
-Use machine learning to fit the conditional LATE model, :math:`\hat{\tau}(X)` by minimizing the following loss function
-
-.. math::
-   L(\hat{\tau}(X)) = \hat{E} \big[\big(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z} \\
-   -\big(\hat{e}_1(X)-\hat{e}_0(X)+\frac{Z(W-\hat{e}_1(X))}{p_Z}-\frac{(1-Z)(W-\hat{e}_0(X))}{1-p_Z}\big) \hat{\tau}(X) \big)^2\big]
-
-with :math:`\{Y^3, X^3, W^3\}`
-
-**Stage 3**
-
-Similar to the DR-Leaner Repeat Stage 1 and Stage 2 again twice with different permutations of partitions for estimation. The final conditional LATE model is the average of the 3 conditional LATE models.

From b166905cb288cd7b29614d6629b4f2f8f339fa6c Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Wed, 29 Dec 2021 09:22:41 -0800
Subject: [PATCH 06/15] Format corrections

---
 docs/methodology.rst | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index a880339e..fa7671b1 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -146,8 +146,10 @@ Fit propensity score models :math:`\hat{e}_0(x)` and :math:`\hat{e}_1(x)` for as
 Use machine learning to fit the conditional LATE model, :math:`\hat{\tau}(X)` by minimizing the following loss function
 
 .. math::
-   L(\hat{\tau}(X)) = \hat{E} \big[\big(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z} \\
-   -\big(\hat{e}_1(X)-\hat{e}_0(X)+\frac{Z(W-\hat{e}_1(X))}{p_Z}-\frac{(1-Z)(W-\hat{e}_0(X))}{1-p_Z}\big) \hat{\tau}(X) \big)^2\big]
+\begin{align*}
+   L(\hat{\tau}(X)) = \hat{E} &\left[\left(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z} \right.\right.\\
+   &\left.\left.\quad -\Big(\hat{e}_1(X)-\hat{e}_0(X)+\frac{Z(W-\hat{e}_1(X))}{p_Z}-\frac{(1-Z)(W-\hat{e}_0(X))}{1-p_Z}\Big) \hat{\tau}(X) \right)^2\right]
+\end{align*}
 
 with :math:`\{Y^3, X^3, W^3\}`
 

From 2c3b111cba9372547284560ade0be8838605c517 Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Wed, 29 Dec 2021 09:33:39 -0800
Subject: [PATCH 07/15] Format corrections

---
 docs/methodology.rst | 2 --
 1 file changed, 2 deletions(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index fa7671b1..04da58d5 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -146,10 +146,8 @@ Fit propensity score models :math:`\hat{e}_0(x)` and :math:`\hat{e}_1(x)` for as
 Use machine learning to fit the conditional LATE model, :math:`\hat{\tau}(X)` by minimizing the following loss function
 
 .. math::
-\begin{align*}
    L(\hat{\tau}(X)) = \hat{E} &\left[\left(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z} \right.\right.\\
    &\left.\left.\quad -\Big(\hat{e}_1(X)-\hat{e}_0(X)+\frac{Z(W-\hat{e}_1(X))}{p_Z}-\frac{(1-Z)(W-\hat{e}_0(X))}{1-p_Z}\Big) \hat{\tau}(X) \right)^2\right]
-\end{align*}
 
 with :math:`\{Y^3, X^3, W^3\}`
 

From f5d73c90d6b7e4692cb710a2e6dc218e6baddb17 Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Wed, 29 Dec 2021 10:09:19 -0800
Subject: [PATCH 08/15] Format corrections

---
 docs/methodology.rst | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index 04da58d5..c8758c1e 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -143,7 +143,7 @@ Fit propensity score models :math:`\hat{e}_0(x)` and :math:`\hat{e}_1(x)` for as
 
 **Stage 2**
 
-Use machine learning to fit the conditional LATE model, :math:`\hat{\tau}(X)` by minimizing the following loss function
+Use machine learning to fit the conditional :ref:`LATE` model, :math:`\hat{\tau}(X)` by minimizing the following loss function
 
 .. math::
    L(\hat{\tau}(X)) = \hat{E} &\left[\left(\hat{m}_1(X)-\hat{m}_0(X)+\frac{Z(Y-\hat{m}_1(X))}{p_Z}-\frac{(1-Z)(Y-\hat{m}_0(X))}{1-p_Z} \right.\right.\\
@@ -305,6 +305,9 @@ The instrumental variables approach attempts to estimate the effect of :math:`W`
 
 The most common method for instrumental variables estimation is the two-stage least squares (2SLS). In this approach, the cause variable :math:`W` is first regressed on the instrument :math:`Z`. Then, in the second stage, the outcome of interest :math:`Y` is regressed on the predicted value from the first-stage model. Intuitively, the effect of :math:`W` on :math:`Y` is estimated by using only the proportion of variation in :math:`W` due to variation in :math:`Z`. See :cite:`10.1257/jep.15.4.69` for a detailed discussion of the method.
 
+LATE
+~~~~
+
 In many situations the treatment, :math:`W`, cannot be administered directly in an experimental setting. However one can randomly assign subjects into treatment/control groups so that subjects in the treatment group can be nudged to take the treatment. This is the case of noncompliance, where subjects may fail to comply with their assignment status, :math:`Z`, as to whether to take treatment or not. Similar to the section of Value optimization methods, in general there are 3 types of subjects in this situation,
 
 * **Compliers** Those who will take the treatment if and only if they are assigned to the treatment group.

From 7f9920b404a9731e4e11b9f7b45d8baa22e366cf Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Wed, 29 Dec 2021 13:53:47 -0800
Subject: [PATCH 09/15] Add TMLE

---
 docs/about.rst       |  5 ++---
 docs/methodology.rst | 53 +++++++++++++++++++++++++++++++++++++++-----
 docs/refs.bib        | 10 +++++++++
 3 files changed, 60 insertions(+), 8 deletions(-)

diff --git a/docs/about.rst b/docs/about.rst
index 041ca936..4bb58789 100644
--- a/docs/about.rst
+++ b/docs/about.rst
@@ -22,10 +22,9 @@ The package currently supports the following methods:
     - :ref:`X-learner`
     - :ref:`R-learner`
     - :ref:`Doubly Robust (DR) learner`
-    - TMLE learner
 - Instrumental variables algorithms
-    - 2-Stage Least Squares (2SLS)
-    - Doubly Robust (DR) IV
+    - :ref:`2-Stage Least Squares (2SLS)`
+    - :ref:`Doubly Robust Instrumental Variable (DRIV) learner`
 - Neural network based algorithms
     - CEVAE
     - DragonNet
diff --git a/docs/methodology.rst b/docs/methodology.rst
index c8758c1e..5e829e92 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -295,20 +295,30 @@ In this way, the IPTW approach can be seen as creating an artificial population
 
 One of the possible benefits of IPTW compared to matching is that less data may be discarded due to lack of overlap between treated and non-treated units. A known problem with the approach is that extreme propensity scores can generate highly variable estimators. Different methods have been proposed for trimming and normalizing the IPT weights (:cite:`https://doi.org/10.1111/1468-0262.00442`). An overview of the IPTW approach can be found in :cite:`https://doi.org/10.1002/sim.6607`.
 
-Instrumental variables
-~~~~~~~~~~~~~~~~~~~~~~
+2-Stage Least Squares (2SLS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The instrumental variables approach attempts to estimate the effect of :math:`W` on :math:`Y` with the help of a third variable :math:`Z` that is correlated with :math:`W` but is uncorrelated with the error term for :math:`Y`. In other words, the instrument :math:`Z` is only related with :math:`Y` through the directed path that goes through :math:`W`. If these conditions are satisfied, the effect of :math:`W` on :math:`Y` can be estimated using the sample analog of:
+One of the basic requirements for identifying the treatment effect of :math:`W` on :math:`Y` is that :math:`W` is orthogonal to the potential outcome of :math:`Y`, conditional on the covariates :math:`X`. This may be violated if both :math:`W` and :math:`Y` are affected by an unobserved variable, the error term after removing the true effect of :math:`W` from :math:`Y`, that is not in :math:`X`. In this case, the instrumental variables approach attempts to estimate the effect of :math:`W` on :math:`Y` with the help of a third variable :math:`Z` that is correlated with :math:`W` but is uncorrelated with the error term. In other words, the instrument :math:`Z` is only related with :math:`Y` through the directed path that goes through :math:`W`. If these conditions are satisfied, in the case without covariates, the effect of :math:`W` on :math:`Y` can be estimated using the sample analog of:
 
 .. math::
    \frac{Cov(Y_i, Z_i)}{Cov(W_i, Z_i)}
 
-The most common method for instrumental variables estimation is the two-stage least squares (2SLS). In this approach, the cause variable :math:`W` is first regressed on the instrument :math:`Z`. Then, in the second stage, the outcome of interest :math:`Y` is regressed on the predicted value from the first-stage model. Intuitively, the effect of :math:`W` on :math:`Y` is estimated by using only the proportion of variation in :math:`W` due to variation in :math:`Z`. See :cite:`10.1257/jep.15.4.69` for a detailed discussion of the method.
+The most common method for instrumental variables estimation is the two-stage least squares (2SLS). In this approach, the cause variable :math:`W` is first regressed on the instrument :math:`Z`. Then, in the second stage, the outcome of interest :math:`Y` is regressed on the predicted value from the first-stage model. Intuitively, the effect of :math:`W` on :math:`Y` is estimated by using only the proportion of variation in :math:`W` due to variation in :math:`Z`. Specifically, assume that we have the linear model
+
+.. math::
+   Y = W \alpha + X \beta + u = \Xi \gamma + u
+
+Here for convenience we let :math:`\Xi=[W, X]` and :math:`\gamma=[\alpha', \beta']'`. Assume that we have instrumental variables :math:`Z` whose number of columns is at least the number of columns of :math:`W`, let :math:`Omega=[Z, X]`, 2SLS estimator is as follows
+
+.. math::
+   \hat{\gamma}_{2SLS} = \left[\Xi'\Omega (\Omega'\Omega)^{-1} \Omega' \Xi\right]^{-1}\left[\Xi'\Omega'(\Omega'\Omega)^{-1}\Omega'Y\right].
+
+See :cite:`10.1257/jep.15.4.69` for a detailed discussion of the method.
 
 LATE
 ~~~~
 
-In many situations the treatment, :math:`W`, cannot be administered directly in an experimental setting. However one can randomly assign subjects into treatment/control groups so that subjects in the treatment group can be nudged to take the treatment. This is the case of noncompliance, where subjects may fail to comply with their assignment status, :math:`Z`, as to whether to take treatment or not. Similar to the section of Value optimization methods, in general there are 3 types of subjects in this situation,
+In many situations the treatment :math:`W` may depend on user's own choice and cannot be administered directly in an experimental setting. However one can randomly assign users into treatment/control groups so that users in the treatment group can be nudged to take the treatment. This is the case of noncompliance, where users may fail to comply with their assignment status, :math:`Z`, as to whether to take treatment or not. Similar to the section of Value optimization methods, in general there are 3 types of users in this situation,
 
 * **Compliers** Those who will take the treatment if and only if they are assigned to the treatment group.
 * **Always-Taker** Those who will take the treatment regardless which group they are assigned to.
@@ -322,3 +332,36 @@ In this case one can measure the treatment effect of Compliers,
    \hat{\tau}_{Complier}=\frac{E[Y|Z=1]-E[Y|Z=0]}{E[W|Z=1]-E[W|Z=0]}
 
 This is Local Average Treatment Effect (LATE). The estimator is also equivalent to 2SLS if we take the assignment status, :math:`Z`, as an instrument.
+
+
+Targeted maximum likelihood estimation (TMLE) for ATE
+-----------------------------------------------------
+
+Targeted maximum likelihood estimation (TMLE) :cite:`tmle` provides a doubly robust semiparametric method that "targets" directly on the average treatment effect with the aid from machine learning algorithms. Compared to other methods including outcome regression and inverse probability of treatment weighting, TMLE usually gives better performance especially when dealing with skewed treatment and outliers.
+
+Given binary treatment :math:`W`, covariates :math:`X`, and outcome :math:`Y`, the TMLE for ATE is performed in the following steps
+
+**Step 1**
+
+Use cross fit to estimate the propensity score :math:`\hat{e}(x)`, the predicted outcome for treated :math:`\hat{m}_1(x)`, and predicted outcome for control :math:`\hat{m}_0(x)` with machine learning.
+
+**Step 2**
+
+Scale :math:`Y` into :math:`\tilde{Y}=\frac{Y-\min Y}{\max Y - \min Y}` so that :math:`\tilde{Y} \in [0,1]`. Use the same scale function to transform :math:`\hat{m}_i(x)` into :math:`\tilde{m}_i(x)`, :math:`i=0,1`. Clip the scaled functions so that their values stay in the unit interval.
+
+**Step 3**
+
+Let :math:`Q=\log(\tilde{m}_W(X)/(1-\tilde{m}_W(X)))`. Maximize the following pseudo log-likelihood function
+
+.. math::
+   \max_{h_0, h_1} -\frac{1}{N} \sum_i & \left[ \tilde{Y}_i \log \left(1+\exp(-Q_i-h_0 \frac{1-W}{1-\hat{e}(X_i)}-h_1 \frac{W}{\hat{e}(X_i)} \right) \right. \\
+   &\quad\left.+(1-tilde{Y}_i)\log\left(1+\exp(Q_i+h_0\frac{1-W}{1-\hat{e}(X_i)}+h_1\frac{W}{\hat{e}(X_i)}\right)\right]
+
+**Step 4**
+
+Let
+.. math::
+   \tilde{Q}_0^* = \frac{1}{1+\exp\left(-Q-h_0 \frac{1}{1-\hat{e}(X)\right)},\\
+   \tilde{Q}_1^* = \frac{1}{1+\exp\left(-Q-h_1 \frac{1}{\hat{e}(X)}\right)}.
+
+The ATE estimate is the sample average of the differences of :math:`\tilde{Q}_1^*` and :math:`\tilde{Q}_0^*` after rescale to the original range.
diff --git a/docs/refs.bib b/docs/refs.bib
index 04137cd9..38fd64a8 100755
--- a/docs/refs.bib
+++ b/docs/refs.bib
@@ -439,3 +439,13 @@ @article{10.1111/ectj.12097
     url = {https://doi.org/10.1111/ectj.12097},
     eprint = {https://academic.oup.com/ectj/article-pdf/21/1/C1/27684918/ectj00c1.pdf},
 }
+
+@book{tmle,
+author = {Laan, Mark and Rose, Sherri},
+year = {2011},
+month = {01},
+pages = {},
+title = {Targeted Learning: Causal Inference for Observational and Experimental Data},
+isbn = {978-1-4419-9781-4},
+doi = {10.1007/978-1-4419-9782-1}
+}

From 408fc320c62a95b0dbed4bf42ab9b15111d06059 Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Wed, 29 Dec 2021 14:08:34 -0800
Subject: [PATCH 10/15] Add TMLE Corrected Citation

---
 docs/refs.bib | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/refs.bib b/docs/refs.bib
index 38fd64a8..0bd302c6 100755
--- a/docs/refs.bib
+++ b/docs/refs.bib
@@ -446,6 +446,7 @@ @book{tmle
 month = {01},
 pages = {},
 title = {Targeted Learning: Causal Inference for Observational and Experimental Data},
+publisher={Springer-Verlag New York},
 isbn = {978-1-4419-9781-4},
 doi = {10.1007/978-1-4419-9782-1}
 }

From d608d6122120d10c189d66ca77cf5dd46b1da622 Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Wed, 29 Dec 2021 14:36:04 -0800
Subject: [PATCH 11/15] Format correction

---
 docs/validation.rst | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/docs/validation.rst b/docs/validation.rst
index 9bb99b4c..7f742ea3 100644
--- a/docs/validation.rst
+++ b/docs/validation.rst
@@ -92,7 +92,7 @@ Mechanism 4
 Validation with Uplift Curve (AUUC)
 ----------------------------------
 
-We can validate the estimation by evaluating and comparing the uplift gains with AUUC (Area Under Uplift Curve), it calculates cumulative gains and please find more details in `meta_learners_with_synthetic_data.ipynb example notebook <https://github.com/uber/causalml/blob/master/examples/meta_learners_with_synthetic_data.ipynb>`_.
+We can validate the estimation by evaluating and comparing the uplift gains with AUUC (Area Under Uplift Curve), it calculates cumulative gains. Please find more details in `meta_learners_with_synthetic_data.ipynb example notebook <https://github.com/uber/causalml/blob/master/examples/meta_learners_with_synthetic_data.ipynb>`.
 
 .. code-block:: python
 
@@ -112,6 +112,8 @@ We can validate the estimation by evaluating and comparing the uplift gains with
 .. image:: ./_static/img/auuc_vis.png
     :width: 629
 
+For data with skewed treatment, it is sometimes advantageous to use :ref:`Targeted maximum likelihood estimation (TMLE) for ATE` to generate the AUUC curve for validation, as TMLE provides a more accurate estimation of ATE. Please find `validation_with_tmle.ipynb example notebook <https://github.com/uber/causalml/blob/master/examples/validation_with_tmle.ipynb>` for details.
+
 Validation with Sensitivity Analysis
 ----------------------------------
 Sensitivity analysis aim to check the robustness of the unconfoundeness assumption. If there is hidden bias (unobserved confounders), it detemineds how severe whould have to be to change conclusion by examine the average treatment effect estimation.
@@ -142,9 +144,9 @@ Selection Bias
 ~~~~~~~~~~~~~~
 
 | `Blackwell(2013) <https://www.mattblackwell.org/files/papers/sens.pdf>` introduced an approach to sensitivity analysis for causal effects that directly models confounding or selection bias.
-| 
-| One Sided Confounding Function: here as the name implies, this function can detect sensitivity to one-sided selection bias, but it would fail to detect other deviations from ignobility. That is, it can only determine the bias resulting from the treatment group being on average better off or the control group being on average better off. 
-| 
+|
+| One Sided Confounding Function: here as the name implies, this function can detect sensitivity to one-sided selection bias, but it would fail to detect other deviations from ignobility. That is, it can only determine the bias resulting from the treatment group being on average better off or the control group being on average better off.
+|
 | Alignment Confounding Function: this type of bias is likely to occur when units select into treatment and control based on their predicted treatment effects
-| 
+|
 | The sensitivity analysis is rigid in this way because the confounding function is not identified from the data, so that the causal model in the last section is only identified conditional on a specific choice of that function. The goal of the sensitivity analysis is not to choose the “correct” confounding function, since we have no way of evaluating this correctness. By its very nature, unmeasured confounding is unmeasured. Rather, the goal is to identify plausible deviations from ignobility and test sensitivity to those deviations. The main harm that results from the incorrect specification of the confounding function is that hidden biases remain hidden.

From e8b25b56b7ac18e1fbaf2ff3db7815a5f7dd17b3 Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Wed, 29 Dec 2021 14:36:19 -0800
Subject: [PATCH 12/15] Format correction

---
 docs/methodology.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index 5e829e92..99a8d66a 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -308,7 +308,7 @@ The most common method for instrumental variables estimation is the two-stage le
 .. math::
    Y = W \alpha + X \beta + u = \Xi \gamma + u
 
-Here for convenience we let :math:`\Xi=[W, X]` and :math:`\gamma=[\alpha', \beta']'`. Assume that we have instrumental variables :math:`Z` whose number of columns is at least the number of columns of :math:`W`, let :math:`Omega=[Z, X]`, 2SLS estimator is as follows
+Here for convenience we let :math:`\Xi=[W, X]` and :math:`\gamma=[\alpha', \beta']'`. Assume that we have instrumental variables :math:`Z` whose number of columns is at least the number of columns of :math:`W`, let :math:`\Omega=[Z, X]`, 2SLS estimator is as follows
 
 .. math::
    \hat{\gamma}_{2SLS} = \left[\Xi'\Omega (\Omega'\Omega)^{-1} \Omega' \Xi\right]^{-1}\left[\Xi'\Omega'(\Omega'\Omega)^{-1}\Omega'Y\right].
@@ -355,11 +355,12 @@ Let :math:`Q=\log(\tilde{m}_W(X)/(1-\tilde{m}_W(X)))`. Maximize the following ps
 
 .. math::
    \max_{h_0, h_1} -\frac{1}{N} \sum_i & \left[ \tilde{Y}_i \log \left(1+\exp(-Q_i-h_0 \frac{1-W}{1-\hat{e}(X_i)}-h_1 \frac{W}{\hat{e}(X_i)} \right) \right. \\
-   &\quad\left.+(1-tilde{Y}_i)\log\left(1+\exp(Q_i+h_0\frac{1-W}{1-\hat{e}(X_i)}+h_1\frac{W}{\hat{e}(X_i)}\right)\right]
+   &\quad\left.+(1-\tilde{Y}_i)\log\left(1+\exp(Q_i+h_0\frac{1-W}{1-\hat{e}(X_i)}+h_1\frac{W}{\hat{e}(X_i)}\right)\right]
 
 **Step 4**
 
 Let
+
 .. math::
    \tilde{Q}_0^* = \frac{1}{1+\exp\left(-Q-h_0 \frac{1}{1-\hat{e}(X)\right)},\\
    \tilde{Q}_1^* = \frac{1}{1+\exp\left(-Q-h_1 \frac{1}{\hat{e}(X)}\right)}.

From 2ca16e05d9da27d641123b822b193c79d3b9431a Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Wed, 29 Dec 2021 14:54:58 -0800
Subject: [PATCH 13/15] Format correction

---
 docs/methodology.rst | 6 +++---
 docs/validation.rst  | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index 99a8d66a..2120387a 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -362,7 +362,7 @@ Let :math:`Q=\log(\tilde{m}_W(X)/(1-\tilde{m}_W(X)))`. Maximize the following ps
 Let
 
 .. math::
-   \tilde{Q}_0^* = \frac{1}{1+\exp\left(-Q-h_0 \frac{1}{1-\hat{e}(X)\right)},\\
-   \tilde{Q}_1^* = \frac{1}{1+\exp\left(-Q-h_1 \frac{1}{\hat{e}(X)}\right)}.
+   \tilde{Q}_0 &= \frac{1}{1+\exp\left(-Q-h_0 \frac{1}{1-\hat{e}(X)\right)},\\
+   \tilde{Q}_1 &= \frac{1}{1+\exp\left(-Q-h_1 \frac{1}{\hat{e}(X)}\right)}.
 
-The ATE estimate is the sample average of the differences of :math:`\tilde{Q}_1^*` and :math:`\tilde{Q}_0^*` after rescale to the original range.
+The ATE estimate is the sample average of the differences of :math:`\tilde{Q}_1` and :math:`\tilde{Q}_0` after rescale to the original range.
diff --git a/docs/validation.rst b/docs/validation.rst
index 7f742ea3..b330b17e 100644
--- a/docs/validation.rst
+++ b/docs/validation.rst
@@ -92,7 +92,7 @@ Mechanism 4
 Validation with Uplift Curve (AUUC)
 ----------------------------------
 
-We can validate the estimation by evaluating and comparing the uplift gains with AUUC (Area Under Uplift Curve), it calculates cumulative gains. Please find more details in `meta_learners_with_synthetic_data.ipynb example notebook <https://github.com/uber/causalml/blob/master/examples/meta_learners_with_synthetic_data.ipynb>`.
+We can validate the estimation by evaluating and comparing the uplift gains with AUUC (Area Under Uplift Curve), it calculates cumulative gains. Please find more details in `meta_learners_with_synthetic_data.ipynb example notebook <https://github.com/uber/causalml/blob/master/examples/meta_learners_with_synthetic_data.ipynb>`_.
 
 .. code-block:: python
 
@@ -112,7 +112,7 @@ We can validate the estimation by evaluating and comparing the uplift gains with
 .. image:: ./_static/img/auuc_vis.png
     :width: 629
 
-For data with skewed treatment, it is sometimes advantageous to use :ref:`Targeted maximum likelihood estimation (TMLE) for ATE` to generate the AUUC curve for validation, as TMLE provides a more accurate estimation of ATE. Please find `validation_with_tmle.ipynb example notebook <https://github.com/uber/causalml/blob/master/examples/validation_with_tmle.ipynb>` for details.
+For data with skewed treatment, it is sometimes advantageous to use :ref:`Targeted maximum likelihood estimation (TMLE) for ATE` to generate the AUUC curve for validation, as TMLE provides a more accurate estimation of ATE. Please find `validation_with_tmle.ipynb example notebook <https://github.com/uber/causalml/blob/master/examples/validation_with_tmle.ipynb>`_ for details.
 
 Validation with Sensitivity Analysis
 ----------------------------------

From 45d472f5de49ebe5d3ca6035b2937808025898b4 Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Wed, 29 Dec 2021 15:06:22 -0800
Subject: [PATCH 14/15] Format correction

---
 docs/methodology.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index 2120387a..7886f6e4 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -362,7 +362,7 @@ Let :math:`Q=\log(\tilde{m}_W(X)/(1-\tilde{m}_W(X)))`. Maximize the following ps
 Let
 
 .. math::
-   \tilde{Q}_0 &= \frac{1}{1+\exp\left(-Q-h_0 \frac{1}{1-\hat{e}(X)\right)},\\
+   \tilde{Q}_0 &= \frac{1}{1+\exp\left(-Q-h_0 \frac{1}{1-\hat{e}(X)}\right)},\\
    \tilde{Q}_1 &= \frac{1}{1+\exp\left(-Q-h_1 \frac{1}{\hat{e}(X)}\right)}.
 
 The ATE estimate is the sample average of the differences of :math:`\tilde{Q}_1` and :math:`\tilde{Q}_0` after rescale to the original range.

From 4677a520ece10931a9e844b1beacc8157b8d13c8 Mon Sep 17 00:00:00 2001
From: Huigang Chen <huigang@gmail.com>
Date: Wed, 29 Dec 2021 16:10:35 -0800
Subject: [PATCH 15/15] Typo correction

---
 docs/methodology.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/methodology.rst b/docs/methodology.rst
index 7886f6e4..7801f30c 100755
--- a/docs/methodology.rst
+++ b/docs/methodology.rst
@@ -153,7 +153,7 @@ with :math:`\{Y^3, X^3, W^3\}`
 
 **Stage 3**
 
-Similar to the DR-Leaner Repeat Stage 1 and Stage 2 again twice with different permutations of partitions for estimation. The final conditional LATE model is the average of the 3 conditional LATE models.
+Similar to the DR-Learner Repeat Stage 1 and Stage 2 again twice with different permutations of partitions for estimation. The final conditional LATE model is the average of the 3 conditional LATE models.
 
 Tree-Based Algorithms
 ---------------------