|
38 | 38 | #' @param ... Additional, optional arguments.
|
39 | 39 | #'
|
40 | 40 | #' @details Let \mjseqn{y} be the response variable for which we want to estimate population mean
|
41 |
| -#' given by \mjsdeqn{\mu_{y} = \frac{1}{N}\sum_{i=1}^N y_i.} For this purposes we consider data integration |
| 41 | +#' given by \mjsdeqn{\mu_{y} = \frac{1}{N} \sum_{i=1}^N y_{i}.} For this purposes we consider data integration |
42 | 42 | #' with the following structure. Let \mjseqn{S_A} be the non-probability sample with design matrix of covariates as
|
43 | 43 | #' \mjsdeqn{
|
44 |
| -#' \begin{equation} |
45 | 44 | #' \boldsymbol{X}_A =
|
46 | 45 | #' \begin{bmatrix}
|
47 | 46 | #' x_{11} & x_{12} & \cdots & x_{1p} \cr
|
48 | 47 | #' x_{21} & x_{22} & \cdots & x_{2p} \cr
|
49 | 48 | #' \vdots & \vdots & \ddots & \vdots \cr
|
50 | 49 | #' x_{n_{A}1} & x_{n_{A}2} & \cdots & x_{n_{A}p} \cr
|
51 | 50 | #' \end{bmatrix}
|
52 |
| -#' \end{equation} |
53 | 51 | #' }
|
54 | 52 | #' and vector of outcome variable
|
55 | 53 | #' \mjsdeqn{
|
56 |
| -#' \begin{equation} |
57 | 54 | #' \boldsymbol{y} =
|
58 | 55 | #' \begin{bmatrix}
|
59 | 56 | #' y_{1} \cr
|
60 | 57 | #' y_{2} \cr
|
61 | 58 | #' \vdots \cr
|
62 | 59 | #' y_{n_{A}}.
|
63 | 60 | #' \end{bmatrix}
|
64 |
| -#' \end{equation} |
65 | 61 | #' }
|
66 | 62 | #' On the other hand let \mjseqn{S_B} be the probability sample with design matrix of covariates as
|
67 | 63 | #' \mjsdeqn{
|
68 |
| -#' \begin{equation} |
69 | 64 | #' \boldsymbol{X}_B =
|
70 | 65 | #' \begin{bmatrix}
|
71 | 66 | #' x_{11} & x_{12} & \cdots & x_{1p} \cr
|
72 | 67 | #' x_{21} & x_{22} & \cdots & x_{2p} \cr
|
73 | 68 | #' \vdots & \vdots & \ddots & \vdots \cr
|
74 | 69 | #' x_{n_{B}1} & x_{n_{B}2} & \cdots & x_{n_{B}p}. \cr
|
75 | 70 | #' \end{bmatrix}
|
76 |
| -#' \end{equation} |
77 | 71 | #' }
|
78 | 72 | #' Instead of sample of units we can consider vector of population totals in the form of \mjseqn{\tau_x = (\sum_{i \in \mathcal{U}}\boldsymbol{x}_{i1}, \sum_{i \in \mathcal{U}}\boldsymbol{x}_{i2}, ..., \sum_{i \in \mathcal{U}}\boldsymbol{x}_{ip})} or means
|
79 | 73 | #' \mjseqn{\frac{\tau_x}{N}}, where \mjseqn{\mathcal{U}} regards to some finite population. Notice that we do not assume access to the response variable for \mjseqn{S_B}.
|
80 | 74 | #' Generally we provide following assumptions:
|
81 |
| -#' 1. The selection indicator of belonging to non-probability sample \mjseqn{R_i} and the response variable \mjseqn{y_i} are independent given the set of covariates \mjseqn{\boldsymbol{x}_i}. |
82 |
| -#' 2. All units have a non-zero propensity score, that is, \mjseqn{\pi_i^A > 0} for all i. |
83 |
| -#' 3. The indicator variables \mjseqn{R_i^A} and \mjseqn{R_j^A} are independent for given \mjseqn{\boldsymbol{x}_i} and \mjseqn{\boldsymbol{x}_j} for \mjseqn{i \neq j}. |
| 75 | +#' 1. The selection indicator of belonging to non-probability sample \mjseqn{R_{i}} and the response variable \mjseqn{y_i} are independent given the set of covariates \mjseqn{\boldsymbol{x}_i}. |
| 76 | +#' 2. All units have a non-zero propensity score, that is, \mjseqn{\pi_{i}^{A} > 0} for all i. |
| 77 | +#' 3. The indicator variables \mjseqn{R_{i}^{A}} and \mjseqn{R_{j}^{A}} are independent for given \mjseqn{\boldsymbol{x}_i} and \mjseqn{\boldsymbol{x}_j} for \mjseqn{i \neq j}. |
84 | 78 | #'
|
85 | 79 | #' There are three possible approaches to problem of population mean estimation using non-probability samples:
|
86 | 80 | #'
|
|
89 | 83 | #' Inverse probability approach is based on assumption that reference probability sample
|
90 | 84 | #' is available and therefore we can estimate propensity score of selection mechanism.
|
91 | 85 | #' Estimator has following form:
|
92 |
| -#' \mjsdeqn{\mu_{IPW} = \frac{1}{N^A}\sum_{i \in S_A}\frac{y_i}{\hat{\pi}_{i}^A}.} |
| 86 | +#' \mjsdeqn{\mu_{IPW} = \frac{1}{N^{A}}\sum_{i \in S_{A}} \frac{y_{i}}{\hat{\pi}_{i}^{A}}.} |
93 | 87 | #' For this purpose with consider multiple ways of estimation. The first approach is Maximum Likelihood Estimation with corrected
|
94 | 88 | #' log-likelihood function which is given by the following formula
|
95 | 89 | #' \mjsdeqn{
|
96 |
| -#' \begin{equation} |
97 |
| -#' \ell^{*}(\boldsymbol{\theta}) = \sum_{i \in S_{A}}\log \left\lbrace \frac{\pi(\boldsymbol{x}_{i}, \boldsymbol{\theta})}{1 - \pi(\boldsymbol{x}_{i},\boldsymbol{\theta})}\right\rbrace + \sum_{i \in S_{B}}d_{i}^{B}\log \left\lbrace 1 - \pi({\boldsymbol{x}_{i},\boldsymbol{\theta})}\right\rbrace. |
98 |
| -#' \end{equation}} |
| 90 | +#' \ell^{*}(\boldsymbol{\theta}) = \sum_{i \in S_{A}}\log \left\lbrace \frac{\pi(\boldsymbol{x}_{i}, \boldsymbol{\theta})}{1 - \pi(\boldsymbol{x}_{i},\boldsymbol{\theta})}\right\rbrace + \sum_{i \in S_{B}}d_{i}^{B}\log \left\lbrace 1 - \pi({\boldsymbol{x}_{i},\boldsymbol{\theta})}\right\rbrace.} |
99 | 91 | #' In the literature main approach is based on `logit` link function when it comes to modelling propensity scores \mjseqn{\pi_i^A},
|
100 | 92 | #' however we expand propensity score model with the additional link functions, such as `cloglog` and `probit`.
|
101 | 93 | #' The pseudo score equations derived from ML methods can be replaced by idea of generalized estimating equations with calibration constraints defined by equations
|
102 | 94 | #' \mjsdeqn{
|
103 |
| -#' \begin{equation} |
104 |
| -#' \mathbf{U}(\boldsymbol{\theta})=\sum_{i \in S_A} \mathbf{h}\left(\mathbf{x}_i, \boldsymbol{\theta}\right)-\sum_{i \in S_B} d_i^B \pi\left(\mathbf{x}_i, \boldsymbol{\theta}\right) \mathbf{h}\left(\mathbf{x}_i, \boldsymbol{\theta}\right). |
105 |
| -#' \end{equation}} |
| 95 | +#' \mathbf{U}(\boldsymbol{\theta})=\sum_{i \in S_A} \mathbf{h}\left(\mathbf{x}_i, \boldsymbol{\theta}\right)-\sum_{i \in S_B} d_i^B \pi\left(\mathbf{x}_i, \boldsymbol{\theta}\right) \mathbf{h}\left(\mathbf{x}_i, \boldsymbol{\theta}\right).} |
106 | 96 | #' Notice that for \mjseqn{ \mathbf{h}\left(\mathbf{x}_i, \boldsymbol{\theta}\right) = \frac{\pi(\boldsymbol{x}, \boldsymbol{\theta})}{\boldsymbol{x}}} we do not require probability
|
107 | 97 | #' sample and can use vector of population totals/means.
|
108 | 98 | #'
|
109 | 99 | #' 2. Mass imputation -- This method relies on framework,
|
110 | 100 | #' where imputed values of the outcome variables are created for whole probability sample.
|
111 | 101 | #' In this case we treat big-data sample as a training dataset, which is used to build an imputation model. Using imputed values
|
112 | 102 | #' for probability sample and design (known) weights, we can build population mean estimator of form:
|
113 |
| -#' \mjsdeqn{\mu_{MI} = \frac{1}{N^B}\sum_{i \in S_B} d_i^B \hat{y}_i.} |
| 103 | +#' \mjsdeqn{\mu_{MI} = \frac{1}{N^B}\sum_{i \in S_{B}} d_{i}^{B} \hat{y}_i.} |
114 | 104 | #' It opens the the door to very flexible method for imputation model. In the package used generalized linear models from [stats::glm()]
|
115 | 105 | #' nearest neighbour algorithm using [RANN::nn2()] and predictive mean matching.
|
116 | 106 | #'
|
117 | 107 | #' 3. Doubly robust estimation -- The IPW and MI estimators are sensible on misspecified models for propensity score and outcome variable respectively.
|
118 | 108 | #' For this purpose so called doubly-robust methods, which take into account these problems, are presented.
|
119 | 109 | #' It is quite simple idea of combination propensity score and imputation models during inference which lead to the following estimator
|
120 |
| -#' \mjsdeqn{\mu_{DR} = \frac{1}{N^A} \sum_{i \in S_A} \hat{d}_i^A (y_i - \hat{y}_i) + \frac{1}{N^B}\sum_{i \in S_B} d_i^B \hat{y}_i.} |
| 110 | +#' \mjsdeqn{\mu_{DR} = \frac{1}{N^A}\sum_{i \in S_A} \hat{d}_i^A (y_i - \hat{y}_i) + \frac{1}{N^B}\sum_{i \in S_B} d_i^B \hat{y}_i.} |
121 | 111 | #' In addition, an approach based directly on bias minimisation has been implemented. Following formula
|
122 | 112 | #' \mjsdeqn{
|
123 | 113 | #' \begin{aligned}
|
|
127 | 117 | #' }
|
128 | 118 | #' lead us to system of equations
|
129 | 119 | #' \mjsdeqn{
|
130 |
| -#' \begin{equation} |
131 | 120 | #' \begin{aligned}
|
132 | 121 | #' J(\theta, \beta) =
|
133 | 122 | #' \left\lbrace
|
|
140 | 129 | #' - \sum_{i \in \mathcal{S}_{\mathrm{B}}} d_i^{\mathrm{B}} \frac{\partial m(\boldsymbol{x}_i, \boldsymbol{\beta})}{\partial \boldsymbol{\beta}}
|
141 | 130 | #' \end{array} \right\rbrace,
|
142 | 131 | #' \end{aligned}
|
143 |
| -#' \end{equation} |
144 | 132 | #' }
|
145 |
| -#' where \mjseqn{m\left(\boldsymbol{x}_i, \boldsymbol{\beta}\right)} is mass imputation (regression) model for outcome variable and |
| 133 | +#' where \mjseqn{m\left(\boldsymbol{x}_{i}, \boldsymbol{\beta}\right)} is mass imputation (regression) model for outcome variable and |
146 | 134 | #' propensity scores \mjseqn{\pi_i^A} are estimated using `logit` function for the model. As in the `MLE` and `GEE` approaches we have expanded
|
147 | 135 | #' this method on `cloglog` and `probit` links.
|
148 | 136 | #'
|
|
0 commit comments