From 1a404097012f9bfff264d658e8b403e08a11ffa9 Mon Sep 17 00:00:00 2001 From: BERENZ Date: Thu, 30 Jan 2025 22:36:07 +0100 Subject: [PATCH] citation file added, description updated, nonprob function documentation updated --- DESCRIPTION | 2 +- NEWS.md | 124 ++++++++++++++++++++-------- R/nonprob_documentation.R | 138 +++++++++++++++---------------- inst/CITATION | 4 +- man/cloglog_model_nonprobsvy.Rd | 2 +- man/logit_model_nonprobsvy.Rd | 2 +- man/nonprob.Rd | 141 ++++++++++++++++---------------- man/probit_model_nonprobsvy.Rd | 2 +- 8 files changed, 235 insertions(+), 180 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index f99809b..23088fe 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -17,7 +17,7 @@ Authors@R: role = c("aut", "ctb"), email = "piotrek.pecet@gmail.com", comment = c(ORCID = "0009-0006-4867-7434"))) -Description: Statistical inference with non-probability samples when auxiliary information from external sources such as probability samples or population totals or means is available. Details can be found in: Wu et al. (2020) , Kim et al. (2021) , Wu et al. (2023) , Kim et al. (2021) , Kim et al. (2020) . +Description: Statistical inference with non-probability samples when auxiliary information from external sources such as probability samples or population totals or means is available. The package implements various methods such as inverse probability (propensity score) weighting, mass imputation and doubly robust approach. Details can be found in: Wu et al. (2020) , Kim et al. (2021) , Wu et al. (2023) , Kim et al. (2021) , Kim et al. (2020) . License: MIT + file LICENSE Encoding: UTF-8 LazyData: true diff --git a/NEWS.md b/NEWS.md index 8574c2b..001f06d 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,75 +1,129 @@ +# nonprobsvy News and Updates + # nonprobsvy 0.2 ------------------------------------------------------------------------ ### Breaking changes -- functions `pop.size`, `controlSel`, `controlOut` and `controlInf` were renamed to `pop_size`, `control_sel`, `control_out` and `control_inf` respectively. -- function `genSimData` removed completely as it is not used anywhere in the package. -- argument `maxLik_method` renamed to `maxlik_method` in the `control_sel` function. +- functions `pop.size`, `controlSel`, `controlOut` and `controlInf` + were renamed to `pop_size`, `control_sel`, `control_out` and + `control_inf` respectively. +- function `genSimData` removed completely as it is not used anywhere + in the package. +- argument `maxLik_method` renamed to `maxlik_method` in the + `control_sel` function. ### Features -- two additional datasets have been included: `jvs` (Job Vacancy Survey; a probability sample survey) and `admin` (Central Job Offers Database; a non-probability sample survey). The units and auxiliary variables have been aligned in a way that allows the data to be integrated using the methods implemented in this package. -- a `nonprobsvycheck` function was added to check the balance in the totals of the variables based on the weighted weights between the non-probability and probability samples. -- citation file added. +- two additional datasets have been included: `jvs` (Job Vacancy + Survey; a probability sample survey) and `admin` (Central Job Offers + Database; a non-probability sample survey). The units and auxiliary + variables have been aligned in a way that allows the data to be + integrated using the methods implemented in this package. +- a `nonprobsvycheck` function was added to check the balance in the + totals of the variables based on the weighted weights between the + non-probability and probability samples. +- citation file added. ### Bugfixes -- basic methods and functions related to variance estimation, weights and probability linking methods have been rewritten in a more optimal and readable way. + +- basic methods and functions related to variance estimation, weights + and probability linking methods have been rewritten in a more + optimal and readable way. ### Other -- more informative error messages added. + +- more informative error messages added. ### Documentation -- annotation has been added that arguments such as `strata`, `subset` and `na_action` are not supported for the time being. +- annotation has been added that arguments such as `strata`, `subset` + and `na_action` are not supported for the time being. # nonprobsvy 0.1.1 ------------------------------------------------------------------------ ### Bugfixes -- bug Fix occurring when estimation was based on auxiliary variable, which led to compression of the data from the frame to the vector. -- bug Fix related to not passing `maxit` argument from `controlSel` function to internally used `nleqslv` function -- bug Fix related to storing `vector` in `model_frame` when predicting `y_hat` in mass imputation `glm` model when X is based in one auxiliary variable only - fix provided converting it to `data.frame` object. - + +- bug Fix occurring when estimation was based on auxiliary variable, + which led to compression of the data from the frame to the vector. +- bug Fix related to not passing `maxit` argument from `controlSel` + function to internally used `nleqslv` function +- bug Fix related to storing `vector` in `model_frame` when predicting + `y_hat` in mass imputation `glm` model when X is based in one + auxiliary variable only - fix provided converting it to `data.frame` + object. + ### Features -- added information to `summary` about quality of estimation basing on difference between estimated and known total values of auxiliary variables -- added estimation of exact standard error for k-nearest neighbor estimator. -- added breaking change to `controlOut` function by switching values for `predictive_match` argument. From now on, the `predictive_match = 1` means $\hat{y}-\hat{y}$ in predictive mean matching imputation and `predictive_match = 2` corresponds to $\hat{y}-y$ matching. -- implemented `div` option when variable selection (more in documentation) for doubly robust estimation. -- added more insights to `nonprob` output such as gradient, hessian and jacobian derived from IPW estimation for `mle` and `gee` methods when `IPW` or `DR` model executed. -- added estimated inclusion probabilities and its derivatives for probability and non-probability samples to `nonprob` output when `IPW` or `DR` model executed. -- added `model_frame` matrix data from probability sample used for mass imputation to `nonprob` when `MI` or `DR` model executed. + +- added information to `summary` about quality of estimation basing on + difference between estimated and known total values of auxiliary + variables +- added estimation of exact standard error for k-nearest neighbor + estimator. +- added breaking change to `controlOut` function by switching values + for `predictive_match` argument. From now on, the + `predictive_match = 1` means $\hat{y}-\hat{y}$ in predictive mean + matching imputation and `predictive_match = 2` corresponds to + $\hat{y}-y$ matching. +- implemented `div` option when variable selection (more in + documentation) for doubly robust estimation. +- added more insights to `nonprob` output such as gradient, hessian + and jacobian derived from IPW estimation for `mle` and `gee` methods + when `IPW` or `DR` model executed. +- added estimated inclusion probabilities and its derivatives for + probability and non-probability samples to `nonprob` output when + `IPW` or `DR` model executed. +- added `model_frame` matrix data from probability sample used for + mass imputation to `nonprob` when `MI` or `DR` model executed. ### Unit tests -- added unit tests for variable selection models and mi estimation with vector of population totals available - + +- added unit tests for variable selection models and mi estimation + with vector of population totals available + # nonprobsvy 0.1.0 ------------------------------------------------------------------------ ### Features -- implemented population mean estimation using doubly robust, inverse probability weighting and mass imputation methods -- implemented inverse probability weighting models with Maximum Likelihood Estimation and Generalized Estimating Equations methods with `logit`, `complementary log-log` and `probit` link functions. -- implemented `generalized linear models`, `nearest neighbours` and `predictive mean matching` methods for Mass Imputation +- implemented population mean estimation using doubly robust, inverse + probability weighting and mass imputation methods +- implemented inverse probability weighting models with Maximum + Likelihood Estimation and Generalized Estimating Equations methods + with `logit`, `complementary log-log` and `probit` link functions. +- implemented `generalized linear models`, `nearest neighbours` and + `predictive mean matching` methods for Mass Imputation - implemented bias correction estimators for doubly-robust approach -- implemented estimation methods when vector of population means/totals is available -- implemented variables selection with `SCAD`, `LASSO` and `MCP` penalization equations -- implemented `analytic` and `bootstrap` (with parallel computation - `doSNOW` package) variance for described estimators +- implemented estimation methods when vector of population + means/totals is available +- implemented variables selection with `SCAD`, `LASSO` and `MCP` + penalization equations +- implemented `analytic` and `bootstrap` (with parallel computation - + `doSNOW` package) variance for described estimators - added control parameters for models - added S3 methods for object of `nonprob` class such as - `nobs` for samples size - `pop.size` for population size estimation - - `residuals` for residuals of the inverse probability weighting model - - `cooks.distance` for identifying influential observations that have a significant impact on the parameter estimates - - `hatvalues` for measuring the leverage of individual observations + - `residuals` for residuals of the inverse probability weighting + model + - `cooks.distance` for identifying influential observations that + have a significant impact on the parameter estimates + - `hatvalues` for measuring the leverage of individual + observations - `logLik` for computing the log-likelihood of the model, - - `AIC` (Akaike Information Criterion) for evaluating the model based on the trade-off between goodness of fit and complexity, helping in model selection - - `BIC` (Bayesian Information Criterion) for a similar purpose as AIC but with a stronger penalty for model complexity - - `confint` for calculating confidence intervals around parameter estimates - - `vcov` for obtaining the variance-covariance matrix of the parameter estimates + - `AIC` (Akaike Information Criterion) for evaluating the model + based on the trade-off between goodness of fit and complexity, + helping in model selection + - `BIC` (Bayesian Information Criterion) for a similar purpose as + AIC but with a stronger penalty for model complexity + - `confint` for calculating confidence intervals around parameter + estimates + - `vcov` for obtaining the variance-covariance matrix of the + parameter estimates - `deviance` for assessing the goodness of fit of the model ### Unit tests diff --git a/R/nonprob_documentation.R b/R/nonprob_documentation.R index d4d63d5..4b03756 100644 --- a/R/nonprob_documentation.R +++ b/R/nonprob_documentation.R @@ -1,45 +1,46 @@ #' @import mathjaxr NULL #' @title Inference with non-probability survey samples -#' @author Łukasz Chrostowski, Maciej Beręsewicz +#' @author Łukasz Chrostowski, Maciej Beręsewicz, Piotr Chlebicki #' #' \loadmathjax -#' @description \code{nonprob} fits a model for inference based on non-probability surveys (including big data) using various methods. -#' The function allows you to estimate the population mean with access to a reference probability sample, as well as sums and means of covariates. +#' @description \code{nonprob} function provides an access to the various methods for inference based on non-probability surveys (including big data). The function allows to estimate the population mean based on the access to a reference probability sample (via the `survey` package), as well as totals or means of covariates. #' #' The package implements state-of-the-art approaches recently proposed in the literature: Chen et al. (2020), -#' Yang et al. (2020), Wu (2022) and uses the [Lumley 2004](https://CRAN.R-project.org/package=survey) `survey` package for inference. +#' Yang et al. (2020), Wu (2022) and uses the [Lumley 2004](https://CRAN.R-project.org/package=survey) `survey` package for inference (if a reference probability sample is provided). +#' +#' It provides various propensity score weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbour, predictive mean matching) and doubly robust estimators (e.g. that take into account minimisation of the asymptotic bias of the population mean estimators). #' -#' It provides propensity score weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbour) and -#' doubly robust estimators that take into account minimisation of the asymptotic bias of the population mean estimators or -#' variable selection. #' The package uses the `survey` package functionality when a probability sample is available. #' +#' All optional parameters are set to `NULL`. The obligatory ones include `data` as well as one of the following three: +#' \code{selection}, \code{outcome}, or \code{target} -- depending on which method has been selected. +#' In the case of \code{outcome} and \code{target} multiple \mjseqn{y} variables can be specified. #' -#' @param data a `data.frame` with data from the non-probability sample. -#' @param selection a `formula`, the selection (propensity) equation. -#' @param outcome a `formula`, the outcome equation. -#' @param target a `formula` with target variables. -#' @param svydesign an optional `svydesign` object (from the survey package) containing a probability sample and design weights. +#' @param data a `data.frame` with dataset containing the non-probability sample. +#' @param selection a `formula` (default `NULL`) for the selection (propensity) score model. +#' @param outcome a `formula` (default `NULL`) for the outcome (target) model. +#' @param target a `formula` (default `NULL`) with target variable(s). We allow multiple target variables (e.g. `~y1 + y2 + y3`). +#' @param svydesign an optional `svydesign2` class object containing a probability sample and design weights. #' @param pop_totals an optional `named vector` with population totals of the covariates. #' @param pop_means an optional `named vector` with population means of the covariates. #' @param pop_size an optional `double` value with population size. -#' @param method_selection a `character` indicating the method for propensity scores estimation. -#' @param method_outcome a `character` indicating the method for response variable estimation. -#' @param family_outcome a `character` string describing the error distribution and the link function to be used in the model, set to `gaussian` by default. Currently supports: gaussian with identity link, poisson and binomial. +#' @param method_selection a `character` (default `logit`) indicating the method for the propensity score link function. +#' @param method_outcome a `character` (default `glm`) indicating the method for the outcome model. +#' @param family_outcome a `character` (default `gaussian`) describing the error distribution and the link function to be used in the model. Currently supports: `gaussian` with the identity link, `poisson` and `binomial`. #' @param subset an optional `vector` specifying a subset of observations to be used in the fitting process - not yet supported. -#' @param strata an optional `vector` specifying strata - not yet supported. -#' @param weights an optional `vector` of prior weights to be used in the fitting process. Should be NULL or a numeric vector. It is assumed that this vector contains frequency or analytic weights. -#' @param na_action a function which indicates what should happen when the data contain `NAs` - not yet supported. -#' @param control_selection a `list` indicating parameters to be used when fitting the selection model for propensity scores. -#' @param control_outcome a `list` indicating parameters to be used when fitting the model for the outcome variable. -#' @param control_inference a `list` indicating parameters to be used for inference based on probability and non-probability samples, contains parameters such as the estimation method or the variance method. +#' @param strata an optional `vector` specifying strata (not yet supported, for further development). +#' @param weights an optional `vector` of prior weights to be used in the fitting process. It is assumed that this vector contains frequency or analytic weights (i.e. rows of the `data` argument are repeated according to the values of the `weights` argument), not probability/design weights. +#' @param na_action a function which indicates what should happen when the data contain `NAs` (not yet supported, for further development). +#' @param control_selection a `list` (default `control_sel()` result) indicating parameters to be used when fitting the selection model for propensity scores. To change the parameters one should use the `control_sel()` function. +#' @param control_outcome a `list` (default `control_out()` result) indicating parameters to be used when fitting the model for the outcome variable. To change the parameters one should use the `control_out()` function. +#' @param control_inference a `list` (default `control_inf()` result) indicating parameters to be used for inference based on probability and non-probability samples. To change the parameters one should use the `control_inf()` function. #' @param start_selection an optional `vector` with starting values for the parameters of the selection equation. #' @param start_outcome an optional `vector` with starting values for the parameters of the outcome equation. -#' @param verbose verbose, numeric. -#' @param x a logical value indicating whether to return model matrix of covariates as a part of the output. -#' @param y a logical value indicating whether to return vector of the outcome variable as a part of the output. -#' @param se Logical value indicating whether to calculate and return standard error of estimated mean. +#' @param verbose a numerical value (default `TRUE`) whether detailed information on the fitting should be presented. +#' @param x a logical value (default `TRUE`) indicating whether to return model matrix of covariates as a part of the output. +#' @param y a logical value (default `TRUE`) indicating whether to return vector of the outcome variable as a part of the output. +#' @param se Logical value (default `TRUE`) indicating whether to calculate and return standard error of estimated mean. #' @param ... Additional, optional arguments. #' #' @details Let \mjseqn{y} be the response variable for which we want to estimate the population mean, @@ -83,7 +84,7 @@ NULL #' #' There are three possible approaches to the problem of estimating population mean using non-probability samples: #' -#' 1. Inverse probability weighting - The main drawback of non-probability sampling is the unknown selection mechanism for a unit to be included in the sample. +#' 1. Inverse probability weighting -- The main drawback of non-probability sampling is the unknown selection mechanism for a unit to be included in the sample. #' This is why we talk about the so-called "biased sample" problem. The inverse probability approach is based on the assumption that a reference probability sample #' is available and therefore we can estimate the propensity score of the selection mechanism. #' The estimator has the following form: @@ -139,7 +140,7 @@ NULL #' #' As it is not straightforward to calculate the variances of these estimators, asymptotic equivalents of the variances derived using the Taylor approximation have been proposed in the literature. #' Details can be found [here](https://ncn-foreigners.github.io/nonprobsvy-book/intro.html). -#' In addition, a bootstrap approach can be used for variance estimation. +#' In addition, the bootstrap approach can be used for variance estimation. #' #' The function also allows variables selection using known methods that have been implemented to handle the integration of probability and non-probability sampling. #' In the presence of high-dimensional data, variable selection is important, because it can reduce the variability in the estimate that results from using irrelevant variables to build the model. @@ -175,58 +176,57 @@ NULL #' probability surveys and big found data for finite population inference using mass imputation. #' Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58 #' -#' @return Returns an object of class \code{c("nonprobsvy", "nonprobsvy_dr")} in case of doubly robust estimator, -#' \code{c("nonprobsvy", "nonprobsvy_mi")} in case of mass imputation estimator and -#' \code{c("nonprobsvy", "nonprobsvy_ipw")} in case of inverse probability weighting estimator -#' with type \code{list} containing:\cr +#' @return Returns an object of class \code{c("nonprobsvy", "nonprobsvy_ipw")} in case of inverse probability weighting estimator, \code{c("nonprobsvy", "nonprobsvy_mi")} in case of mass imputation estimator, or \code{c("nonprobsvy", "nonprobsvy_dr")} in case of doubly robust estimator, +#' +#' of type \code{list} containing:\cr #' \itemize{ -#' \item{\code{X} -- model matrix containing data from probability and non-probability samples if specified at a function call.} -#' \item{\code{y}} -- list of vector of outcome variables if specified at a function call. -#' \item{\code{R}} -- vector indicating the probablistic (0) or non-probablistic (1) units in the matrix X. -#' \item{\code{prob} -- vector of estimated propensity scores for non-probability sample.} -#' \item{\code{weights} -- vector of estimated weights for non-probability sample.} -#' \item{\code{control} -- list of control functions.} -#' \item{\code{output} -- output of the model with information on the estimated population mean and standard errors.} -#' \item{\code{SE} -- standard error of the estimator of the population mean, divided into errors from probability and non-probability samples.} -#' \item{\code{confidence_interval} -- confidence interval of population mean estimator.} -#' \item{\code{nonprob_size} -- size of non-probability sample.} -#' \item{\code{prob_size} -- size of probability sample.} -#' \item{\code{pop_size} -- estimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample).} -#' \item{\code{pop_totals} -- the total values of the auxiliary variables derived from a probability sample or vector of total/mean values.} -#' \item{\code{outcome} -- list containing information about the fitting of the mass imputation model, in the case of regression model the object containing the list returned by +#' \item{\code{X} -- a `model.matrix` containing data from probability and non-probability samples if specified at a function call.} +#' \item{\code{y} -- a `list` of vector of outcome variables if specified at a function call.} +#' \item{\code{R} -- a `numeric vector` indicating whether a unit belongs to the probability (0) or non-probability (1) units in the matrix X.} +#' \item{\code{prob} -- a `numeric vector` of estimated propensity scores for non-probability sample.} +#' \item{\code{weights} -- a `vector` of estimated weights for non-probability sample.} +#' \item{\code{control} -- a `list` of control functions.} +#' \item{\code{output} -- an output of the model with information on the estimated population mean and standard errors.} +#' \item{\code{SE} -- a `data.frame` with standard error of the estimator of the population mean, divided into errors from probability and non-probability samples.} +#' \item{\code{confidence_interval} -- a `data.frame` with confidence interval of population mean estimator.} +#' \item{\code{nonprob_size} -- a scalar `numeric vector` denoting the size of non-probability sample.} +#' \item{\code{prob_size} -- a scalar `numeric vector` denoting the size of probability sample.} +#' \item{\code{pop_size} -- a scalar `numeric vector` estimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample).} +#' \item{\code{pop_totals} -- a `numeric vector` with the total values of the auxiliary variables derived from a probability sample or a `numeric vector` of the total/mean values.} +#' \item{\code{outcome} -- a `list` containing information about the fitting of the mass imputation model, in the case of regression model the object containing the list returned by #' [stats::glm()], in the case of the nearest neighbour imputation the object containing list returned by [RANN::nn2()]. If `bias_correction` in [control_inf()] is set to `TRUE`, the estimation is based on #' the joint estimating equations for the `selection` and `outcome` model and therefore, the list is different from the one returned by the [stats::glm()] function and contains elements such as #' \itemize{ -#' \item{\code{coefficients} -- estimated coefficients of the regression model.} -#' \item{\code{std_err} -- standard errors of the estimated coefficients.} -#' \item{\code{residuals} -- The response residuals.} -#' \item{\code{variance_covariance} -- The variance-covariance matrix of the coefficient estimates.} -#' \item{\code{df_residual} -- The degrees of freedom for residuals.} -#' \item{\code{family} -- specifies the error distribution and link function to be used in the model.} -#' \item{\code{fitted.values} -- The predicted values of the response variable based on the fitted model.} -#' \item{\code{linear.predictors} -- The linear fit on link scale.} -#' \item{\code{X} -- The design matrix.} +#' \item{\code{coefficients} -- a `numeric vector` with estimated coefficients of the regression model.} +#' \item{\code{std_err} -- a `numeric vector` with standard errors of the estimated coefficients.} +#' \item{\code{residuals} -- a `numeric vector` with the response residuals.} +#' \item{\code{variance_covariance} -- a `matrix` with the variance-covariance matrix of the coefficient estimates.} +#' \item{\code{df_residual} -- a scalar `vector` with the degrees of freedom for residuals.} +#' \item{\code{family} -- a `character` that specifies the error distribution and link function to be used in the model.} +#' \item{\code{fitted.values} -- a `numeric vector` with the predicted values of the response variable based on the fitted model.} +#' \item{\code{linear.predictors} -- a `numeric vector` with the linear fit on link scale.} +#' \item{\code{X} -- a `matrix` with the design matrix (`model.matrix`)} #' \item{\code{method} -- set on `glm`, since the regression method.} -#' \item{\code{model_frame} -- Matrix of data from probability sample used for mass imputation.} +#' \item{\code{model_frame} -- a `model.matrix` of data from probability sample used for mass imputation.} #' } #' } #' In addition, if the variable selection model for the outcome variable is fitting, the list includes the #' \itemize{ #' \item{\code{cve} -- the error for each value of `lambda`, averaged across the cross-validation folds.} #' } -#' \item{\code{selection} -- list containing information about fitting of propensity score model, such as +#' \item{\code{selection} -- a `list` containing information about fitting of propensity score model, such as #' \itemize{ -#' \item{\code{coefficients} -- a named vector of coefficients.} -#' \item{\code{std_err} -- standard errors of the estimated model coefficients.} -#' \item{\code{residuals} -- the response residuals.} -#' \item{\code{variance} -- the root mean square error.} -#' \item{\code{fitted_values} -- the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.} +#' \item{\code{coefficients} -- a `numeric vector` of coefficients.} +#' \item{\code{std_err} -- a `numeric vector` with standard errors of the estimated model coefficients.} +#' \item{\code{residuals} -- a `numeric vector` with the response residuals.} +#' \item{\code{variance} -- a scalar `numeric vector` the root mean square error.} +#' \item{\code{fitted_values} -- a `numeric vector` with the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.} #' \item{\code{link} -- the `link` object used.} -#' \item{\code{linear_predictors} -- the linear fit on link scale.} +#' \item{\code{linear_predictors} -- a `numeric vector` with the linear fit on link scale.} #' \item{\code{aic} -- A version of Akaike's An Information Criterion, minus twice the maximized log-likelihood plus twice the number of parameters.} -#' \item{\code{weights} -- vector of estimated weights for non-probability sample.} -#' \item{\code{prior.weights} -- the weights initially supplied, a vector of 1s if none were.} -#' \item{\code{est_totals} -- the estimated total values of auxiliary variables derived from a non-probability sample}. +#' \item{\code{weights} -- a `numeric vector` with estimated weights for non-probability sample.} +#' \item{\code{prior.weights} -- a `numeric vector` with the frequency weights initially supplied, a vector of 1s if none were.} +#' \item{\code{est_totals} -- a `numeric vector` with the estimated total values of auxiliary variables derived from a non-probability sample}. #' \item{\code{formula} -- the formula supplied.} #' \item{\code{df_residual} -- the residual degrees of freedom.} #' \item{\code{log_likelihood} -- value of log-likelihood function if `mle` method, in the other case `NA`.} @@ -237,12 +237,12 @@ NULL #' \item{\code{gradient} -- Gradient of the log-likelihood function from `mle` method.} #' \item{\code{method} -- An estimation method for selection model, e.g. `mle` or `gee`.} #' \item{\code{prob_der} -- Derivative of the inclusion probability function for units in a non--probability sample.} -#' \item{\code{prob_rand} -- Inclusion probabilities for unit from a probabiliy sample from `svydesign` object.} -#' \item{\code{prob_rand_est} -- Inclusion probabilites to a non--probabiliy sample for unit from probability sample.} -#' \item{\code{prob_rand_est_der} -- Derivative of the inclusion probabilites to a non--probabiliy sample for unit from probability sample.} +#' \item{\code{prob_rand} -- Inclusion probabilities for unit from a probability sample from the `svydesign` object.} +#' \item{\code{prob_rand_est} -- Inclusion probabilities to a non-probability sample for unit from probability sample.} +#' \item{\code{prob_rand_est_der} -- Derivative of the inclusion probabilities to a non--probability sample for unit from probability sample.} #' } #' } -#' \item{\code{stat} -- matrix of the estimated population means in each bootstrap iteration. +#' \item{\code{stat} -- a `matrix` of the estimated population means in each bootstrap iteration. #' Returned only if a bootstrap method is used to estimate the variance and \code{keep_boot} in #' [control_inf()] is set on `TRUE`.} #' } diff --git a/inst/CITATION b/inst/CITATION index c8d43dc..b04eac4 100644 --- a/inst/CITATION +++ b/inst/CITATION @@ -5,8 +5,8 @@ bibentry( bibtype = "Manual", title = "Inference Based on Non-Probability Samples", author = c(person("Łukasz", "Chrostowski"), - person("Maciej", "Beręsewicz", - person("Piotr", "Chlebicki"))), + person("Maciej", "Beręsewicz"), + person("Piotr", "Chlebicki")), note = "R package version 0.2", year = "2025", url = "https://github.com/ncn-foreigners/nonprobsvy" diff --git a/man/cloglog_model_nonprobsvy.Rd b/man/cloglog_model_nonprobsvy.Rd index cbde79d..d60ed4b 100644 --- a/man/cloglog_model_nonprobsvy.Rd +++ b/man/cloglog_model_nonprobsvy.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/cloglogModel.R +% Please edit documentation in R/model_cloglog.R \name{cloglog_model_nonprobsvy} \alias{cloglog_model_nonprobsvy} \title{Complementary log-log model for weights adjustment} diff --git a/man/logit_model_nonprobsvy.Rd b/man/logit_model_nonprobsvy.Rd index 0a3248c..9257116 100644 --- a/man/logit_model_nonprobsvy.Rd +++ b/man/logit_model_nonprobsvy.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/logitModel.R +% Please edit documentation in R/model_logit.R \name{logit_model_nonprobsvy} \alias{logit_model_nonprobsvy} \title{Logit model for weights adjustment} diff --git a/man/nonprob.Rd b/man/nonprob.Rd index 8d75d51..2bec316 100644 --- a/man/nonprob.Rd +++ b/man/nonprob.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/main_function_documentation.R, R/nonprob.R +% Please edit documentation in R/nonprob.R, R/nonprob_documentation.R \name{nonprob} \alias{nonprob} \title{Inference with non-probability survey samples} @@ -33,15 +33,15 @@ nonprob( ) } \arguments{ -\item{data}{a \code{data.frame} with data from the non-probability sample.} +\item{data}{a \code{data.frame} with dataset containing the non-probability sample.} -\item{selection}{a \code{formula}, the selection (propensity) equation.} +\item{selection}{a \code{formula} (default \code{NULL}) for the selection (propensity) score model.} -\item{outcome}{a \code{formula}, the outcome equation.} +\item{outcome}{a \code{formula} (default \code{NULL}) for the outcome (target) model.} -\item{target}{a \code{formula} with target variables.} +\item{target}{a \code{formula} (default \code{NULL}) with target variable(s). We allow multiple target variables (e.g. \code{~y1 + y2 + y3}).} -\item{svydesign}{an optional \code{svydesign} object (from the survey package) containing a probability sample and design weights.} +\item{svydesign}{an optional \code{svydesign2} class object containing a probability sample and design weights.} \item{pop_totals}{an optional \verb{named vector} with population totals of the covariates.} @@ -49,93 +49,92 @@ nonprob( \item{pop_size}{an optional \code{double} value with population size.} -\item{method_selection}{a \code{character} indicating the method for propensity scores estimation.} +\item{method_selection}{a \code{character} (default \code{logit}) indicating the method for the propensity score link function.} -\item{method_outcome}{a \code{character} indicating the method for response variable estimation.} +\item{method_outcome}{a \code{character} (default \code{glm}) indicating the method for the outcome model.} -\item{family_outcome}{a \code{character} string describing the error distribution and the link function to be used in the model, set to \code{gaussian} by default. Currently supports: gaussian with identity link, poisson and binomial.} +\item{family_outcome}{a \code{character} (default \code{gaussian}) describing the error distribution and the link function to be used in the model. Currently supports: \code{gaussian} with the identity link, \code{poisson} and \code{binomial}.} \item{subset}{an optional \code{vector} specifying a subset of observations to be used in the fitting process - not yet supported.} -\item{strata}{an optional \code{vector} specifying strata - not yet supported.} +\item{strata}{an optional \code{vector} specifying strata (not yet supported, for further development).} -\item{weights}{an optional \code{vector} of prior weights to be used in the fitting process. Should be NULL or a numeric vector. It is assumed that this vector contains frequency or analytic weights.} +\item{weights}{an optional \code{vector} of prior weights to be used in the fitting process. It is assumed that this vector contains frequency or analytic weights (i.e. rows of the \code{data} argument are repeated according to the values of the \code{weights} argument), not probability/design weights.} -\item{na_action}{a function which indicates what should happen when the data contain \code{NAs} - not yet supported.} +\item{na_action}{a function which indicates what should happen when the data contain \code{NAs} (not yet supported, for further development).} -\item{control_selection}{a \code{list} indicating parameters to be used when fitting the selection model for propensity scores.} +\item{control_selection}{a \code{list} (default \code{control_sel()} result) indicating parameters to be used when fitting the selection model for propensity scores. To change the parameters one should use the \code{control_sel()} function.} -\item{control_outcome}{a \code{list} indicating parameters to be used when fitting the model for the outcome variable.} +\item{control_outcome}{a \code{list} (default \code{control_out()} result) indicating parameters to be used when fitting the model for the outcome variable. To change the parameters one should use the \code{control_out()} function.} -\item{control_inference}{a \code{list} indicating parameters to be used for inference based on probability and non-probability samples, contains parameters such as the estimation method or the variance method.} +\item{control_inference}{a \code{list} (default \code{control_inf()} result) indicating parameters to be used for inference based on probability and non-probability samples. To change the parameters one should use the \code{control_inf()} function.} \item{start_selection}{an optional \code{vector} with starting values for the parameters of the selection equation.} \item{start_outcome}{an optional \code{vector} with starting values for the parameters of the outcome equation.} -\item{verbose}{verbose, numeric.} +\item{verbose}{a numerical value (default \code{TRUE}) whether detailed information on the fitting should be presented.} -\item{x}{a logical value indicating whether to return model matrix of covariates as a part of the output.} +\item{x}{a logical value (default \code{TRUE}) indicating whether to return model matrix of covariates as a part of the output.} -\item{y}{a logical value indicating whether to return vector of the outcome variable as a part of the output.} +\item{y}{a logical value (default \code{TRUE}) indicating whether to return vector of the outcome variable as a part of the output.} -\item{se}{Logical value indicating whether to calculate and return standard error of estimated mean.} +\item{se}{Logical value (default \code{TRUE}) indicating whether to calculate and return standard error of estimated mean.} \item{...}{Additional, optional arguments.} } \value{ -Returns an object of class \code{c("nonprobsvy", "nonprobsvy_dr")} in case of doubly robust estimator, -\code{c("nonprobsvy", "nonprobsvy_mi")} in case of mass imputation estimator and -\code{c("nonprobsvy", "nonprobsvy_ipw")} in case of inverse probability weighting estimator -with type \code{list} containing:\cr +Returns an object of class \code{c("nonprobsvy", "nonprobsvy_ipw")} in case of inverse probability weighting estimator, \code{c("nonprobsvy", "nonprobsvy_mi")} in case of mass imputation estimator, or \code{c("nonprobsvy", "nonprobsvy_dr")} in case of doubly robust estimator, + +of type \code{list} containing:\cr \itemize{ -\item{\code{X} -- model matrix containing data from probability and non-probability samples if specified at a function call.} -\item{\code{y}} -- list of vector of outcome variables if specified at a function call. -\item{\code{R}} -- vector indicating the probablistic (0) or non-probablistic (1) units in the matrix X. -\item{\code{prob} -- vector of estimated propensity scores for non-probability sample.} -\item{\code{weights} -- vector of estimated weights for non-probability sample.} -\item{\code{control} -- list of control functions.} -\item{\code{output} -- output of the model with information on the estimated population mean and standard errors.} -\item{\code{SE} -- standard error of the estimator of the population mean, divided into errors from probability and non-probability samples.} -\item{\code{confidence_interval} -- confidence interval of population mean estimator.} -\item{\code{nonprob_size} -- size of non-probability sample.} -\item{\code{prob_size} -- size of probability sample.} -\item{\code{pop_size} -- estimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample).} -\item{\code{pop_totals} -- the total values of the auxiliary variables derived from a probability sample or vector of total/mean values.} -\item{\code{outcome} -- list containing information about the fitting of the mass imputation model, in the case of regression model the object containing the list returned by +\item{\code{X} -- a \code{model.matrix} containing data from probability and non-probability samples if specified at a function call.} +\item{\code{y} -- a \code{list} of vector of outcome variables if specified at a function call.} +\item{\code{R} -- a \verb{numeric vector} indicating whether a unit belongs to the probability (0) or non-probability (1) units in the matrix X.} +\item{\code{prob} -- a \verb{numeric vector} of estimated propensity scores for non-probability sample.} +\item{\code{weights} -- a \code{vector} of estimated weights for non-probability sample.} +\item{\code{control} -- a \code{list} of control functions.} +\item{\code{output} -- an output of the model with information on the estimated population mean and standard errors.} +\item{\code{SE} -- a \code{data.frame} with standard error of the estimator of the population mean, divided into errors from probability and non-probability samples.} +\item{\code{confidence_interval} -- a \code{data.frame} with confidence interval of population mean estimator.} +\item{\code{nonprob_size} -- a scalar \verb{numeric vector} denoting the size of non-probability sample.} +\item{\code{prob_size} -- a scalar \verb{numeric vector} denoting the size of probability sample.} +\item{\code{pop_size} -- a scalar \verb{numeric vector} estimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample).} +\item{\code{pop_totals} -- a \verb{numeric vector} with the total values of the auxiliary variables derived from a probability sample or a \verb{numeric vector} of the total/mean values.} +\item{\code{outcome} -- a \code{list} containing information about the fitting of the mass imputation model, in the case of regression model the object containing the list returned by \code{\link[stats:glm]{stats::glm()}}, in the case of the nearest neighbour imputation the object containing list returned by \code{\link[RANN:nn2]{RANN::nn2()}}. If \code{bias_correction} in \code{\link[=control_inf]{control_inf()}} is set to \code{TRUE}, the estimation is based on the joint estimating equations for the \code{selection} and \code{outcome} model and therefore, the list is different from the one returned by the \code{\link[stats:glm]{stats::glm()}} function and contains elements such as \itemize{ -\item{\code{coefficients} -- estimated coefficients of the regression model.} -\item{\code{std_err} -- standard errors of the estimated coefficients.} -\item{\code{residuals} -- The response residuals.} -\item{\code{variance_covariance} -- The variance-covariance matrix of the coefficient estimates.} -\item{\code{df_residual} -- The degrees of freedom for residuals.} -\item{\code{family} -- specifies the error distribution and link function to be used in the model.} -\item{\code{fitted.values} -- The predicted values of the response variable based on the fitted model.} -\item{\code{linear.predictors} -- The linear fit on link scale.} -\item{\code{X} -- The design matrix.} +\item{\code{coefficients} -- a \verb{numeric vector} with estimated coefficients of the regression model.} +\item{\code{std_err} -- a \verb{numeric vector} with standard errors of the estimated coefficients.} +\item{\code{residuals} -- a \verb{numeric vector} with the response residuals.} +\item{\code{variance_covariance} -- a \code{matrix} with the variance-covariance matrix of the coefficient estimates.} +\item{\code{df_residual} -- a scalar \code{vector} with the degrees of freedom for residuals.} +\item{\code{family} -- a \code{character} that specifies the error distribution and link function to be used in the model.} +\item{\code{fitted.values} -- a \verb{numeric vector} with the predicted values of the response variable based on the fitted model.} +\item{\code{linear.predictors} -- a \verb{numeric vector} with the linear fit on link scale.} +\item{\code{X} -- a \code{matrix} with the design matrix (\code{model.matrix})} \item{\code{method} -- set on \code{glm}, since the regression method.} -\item{\code{model_frame} -- Matrix of data from probability sample used for mass imputation.} +\item{\code{model_frame} -- a \code{model.matrix} of data from probability sample used for mass imputation.} } } In addition, if the variable selection model for the outcome variable is fitting, the list includes the \itemize{ \item{\code{cve} -- the error for each value of \code{lambda}, averaged across the cross-validation folds.} } -\item{\code{selection} -- list containing information about fitting of propensity score model, such as +\item{\code{selection} -- a \code{list} containing information about fitting of propensity score model, such as \itemize{ -\item{\code{coefficients} -- a named vector of coefficients.} -\item{\code{std_err} -- standard errors of the estimated model coefficients.} -\item{\code{residuals} -- the response residuals.} -\item{\code{variance} -- the root mean square error.} -\item{\code{fitted_values} -- the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.} +\item{\code{coefficients} -- a \verb{numeric vector} of coefficients.} +\item{\code{std_err} -- a \verb{numeric vector} with standard errors of the estimated model coefficients.} +\item{\code{residuals} -- a \verb{numeric vector} with the response residuals.} +\item{\code{variance} -- a scalar \verb{numeric vector} the root mean square error.} +\item{\code{fitted_values} -- a \verb{numeric vector} with the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.} \item{\code{link} -- the \code{link} object used.} -\item{\code{linear_predictors} -- the linear fit on link scale.} +\item{\code{linear_predictors} -- a \verb{numeric vector} with the linear fit on link scale.} \item{\code{aic} -- A version of Akaike's An Information Criterion, minus twice the maximized log-likelihood plus twice the number of parameters.} -\item{\code{weights} -- vector of estimated weights for non-probability sample.} -\item{\code{prior.weights} -- the weights initially supplied, a vector of 1s if none were.} -\item{\code{est_totals} -- the estimated total values of auxiliary variables derived from a non-probability sample}. +\item{\code{weights} -- a \verb{numeric vector} with estimated weights for non-probability sample.} +\item{\code{prior.weights} -- a \verb{numeric vector} with the frequency weights initially supplied, a vector of 1s if none were.} +\item{\code{est_totals} -- a \verb{numeric vector} with the estimated total values of auxiliary variables derived from a non-probability sample}. \item{\code{formula} -- the formula supplied.} \item{\code{df_residual} -- the residual degrees of freedom.} \item{\code{log_likelihood} -- value of log-likelihood function if \code{mle} method, in the other case \code{NA}.} @@ -146,27 +145,29 @@ when the propensity score model is fitting. Returned only if selection of variab \item{\code{gradient} -- Gradient of the log-likelihood function from \code{mle} method.} \item{\code{method} -- An estimation method for selection model, e.g. \code{mle} or \code{gee}.} \item{\code{prob_der} -- Derivative of the inclusion probability function for units in a non--probability sample.} -\item{\code{prob_rand} -- Inclusion probabilities for unit from a probabiliy sample from \code{svydesign} object.} -\item{\code{prob_rand_est} -- Inclusion probabilites to a non--probabiliy sample for unit from probability sample.} -\item{\code{prob_rand_est_der} -- Derivative of the inclusion probabilites to a non--probabiliy sample for unit from probability sample.} +\item{\code{prob_rand} -- Inclusion probabilities for unit from a probability sample from the \code{svydesign} object.} +\item{\code{prob_rand_est} -- Inclusion probabilities to a non-probability sample for unit from probability sample.} +\item{\code{prob_rand_est_der} -- Derivative of the inclusion probabilities to a non--probability sample for unit from probability sample.} } } -\item{\code{stat} -- matrix of the estimated population means in each bootstrap iteration. +\item{\code{stat} -- a \code{matrix} of the estimated population means in each bootstrap iteration. Returned only if a bootstrap method is used to estimate the variance and \code{keep_boot} in \code{\link[=control_inf]{control_inf()}} is set on \code{TRUE}.} } } \description{ -\code{nonprob} fits a model for inference based on non-probability surveys (including big data) using various methods. -The function allows you to estimate the population mean with access to a reference probability sample, as well as sums and means of covariates. +\code{nonprob} function provides an access to the various methods for inference based on non-probability surveys (including big data). The function allows to estimate the population mean based on the access to a reference probability sample (via the \code{survey} package), as well as totals or means of covariates. The package implements state-of-the-art approaches recently proposed in the literature: Chen et al. (2020), -Yang et al. (2020), Wu (2022) and uses the \href{https://CRAN.R-project.org/package=survey}{Lumley 2004} \code{survey} package for inference. +Yang et al. (2020), Wu (2022) and uses the \href{https://CRAN.R-project.org/package=survey}{Lumley 2004} \code{survey} package for inference (if a reference probability sample is provided). + +It provides various propensity score weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbour, predictive mean matching) and doubly robust estimators (e.g. that take into account minimisation of the asymptotic bias of the population mean estimators). -It provides propensity score weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbour) and -doubly robust estimators that take into account minimisation of the asymptotic bias of the population mean estimators or -variable selection. The package uses the \code{survey} package functionality when a probability sample is available. + +All optional parameters are set to \code{NULL}. The obligatory ones include \code{data} as well as one of the following three: +\code{selection}, \code{outcome}, or \code{target} -- depending on which method has been selected. +In the case of \code{outcome} and \code{target} multiple \mjseqn{y} variables can be specified. } \details{ Let \mjseqn{y} be the response variable for which we want to estimate the population mean, @@ -212,7 +213,7 @@ In general we make the following assumptions: There are three possible approaches to the problem of estimating population mean using non-probability samples: \enumerate{ -\item Inverse probability weighting - The main drawback of non-probability sampling is the unknown selection mechanism for a unit to be included in the sample. +\item Inverse probability weighting -- The main drawback of non-probability sampling is the unknown selection mechanism for a unit to be included in the sample. This is why we talk about the so-called "biased sample" problem. The inverse probability approach is based on the assumption that a reference probability sample is available and therefore we can estimate the propensity score of the selection mechanism. The estimator has the following form: @@ -267,7 +268,7 @@ this method to \code{cloglog} and \code{probit} links. As it is not straightforward to calculate the variances of these estimators, asymptotic equivalents of the variances derived using the Taylor approximation have been proposed in the literature. Details can be found \href{https://ncn-foreigners.github.io/nonprobsvy-book/intro.html}{here}. -In addition, a bootstrap approach can be used for variance estimation. +In addition, the bootstrap approach can be used for variance estimation. The function also allows variables selection using known methods that have been implemented to handle the integration of probability and non-probability sampling. In the presence of high-dimensional data, variable selection is important, because it can reduce the variability in the estimate that results from using irrelevant variables to build the model. @@ -400,7 +401,7 @@ estimation process of the bias minimization approach. \code{\link[=control_inf]{control_inf()}} -- For the control parameters related to statistical inference. } \author{ -Łukasz Chrostowski, Maciej Beręsewicz +Łukasz Chrostowski, Maciej Beręsewicz, Piotr Chlebicki \loadmathjax } diff --git a/man/probit_model_nonprobsvy.Rd b/man/probit_model_nonprobsvy.Rd index 89f176a..e64b776 100644 --- a/man/probit_model_nonprobsvy.Rd +++ b/man/probit_model_nonprobsvy.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/probitModel.R +% Please edit documentation in R/model_probit.R \name{probit_model_nonprobsvy} \alias{probit_model_nonprobsvy} \title{Probit model for weights adjustment}