From 915c5f38a12bc1540bf5b66fd569b9123685bb47 Mon Sep 17 00:00:00 2001 From: LukaszChrostowski Date: Tue, 12 Nov 2024 19:27:34 +0100 Subject: [PATCH] small changes in documentation before CRAN submission --- NEWS.md | 3 ++ R/main_function_documentation.R | 69 +++++++++++++++++++-------------- R/summary.R | 2 +- man/nonprob.Rd | 69 +++++++++++++++++++-------------- man/summary.nonprobsvy.Rd | 2 +- 5 files changed, 83 insertions(+), 62 deletions(-) diff --git a/NEWS.md b/NEWS.md index c11f086..868c337 100644 --- a/NEWS.md +++ b/NEWS.md @@ -9,6 +9,9 @@ - add estimation of exact standard error for k-nearest neighbor estimator. - add breaking change to `controlOut` function by switching values for `predictive_match` argument. From now on, the `predictive_match = 1` means $\hat{y}-\hat{y}$ in predictive mean matching imputation and `predictive_match = 2` corresponds to $\hat{y}-y$ matching. - implement `div` option when variable selection (more in documentation) for doubly robust estimation. + - add more insights to `nonprob` output such as gradient, hessian and jacobian derived from IPW estimation for `mle` and `gee` methods when `IPW` or `DR` model executed. + - add estimated inclusion probabilities and its derivatives for probability and non-probability samples to `nonprob` output when `IPW` or `DR` model executed. + - add `model_frame` matrix data from probability sample used for mass imputation to `nonprob` when `MI` or `DR` model executed. ## nonprobsvy 0.1.0 diff --git a/R/main_function_documentation.R b/R/main_function_documentation.R index 3fcb96d..c16d6ce 100644 --- a/R/main_function_documentation.R +++ b/R/main_function_documentation.R @@ -8,9 +8,9 @@ NULL #' The function allows you to estimate the population mean with access to a reference probability sample, as well as sums and means of covariates. #' #' The package implements state-of-the-art approaches recently proposed in the literature: Chen et al. (2020), -#' Yang et al. (2020), Wu (2022) and use the [Lumley 2004](https://CRAN.R-project.org/package=survey) `survey` package for inference. +#' Yang et al. (2020), Wu (2022) and uses the [Lumley 2004](https://CRAN.R-project.org/package=survey) `survey` package for inference. #' -#' It provides propensity score weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbor) and +#' It provides propensity score weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbour) and #' doubly robust estimators that take into account minimisation of the asymptotic bias of the population mean estimators or #' variable selection. #' The package uses `survey` package functionality when a probability sample is available. @@ -24,19 +24,19 @@ NULL #' @param pop_totals an optional `named vector` with population totals of the covariates. #' @param pop_means an optional `named vector` with population means of the covariates. #' @param pop_size an optional `double` with population size. -#' @param method_selection a `character` with method for propensity scores estimation -#' @param method_outcome a `character` with method for response variable estimation +#' @param method_selection a `character` with method for propensity scores estimation. +#' @param method_outcome a `character` with method for response variable estimation. #' @param family_outcome a `character` string describing the error distribution and link function to be used in the model. Default is "gaussian". Currently supports: gaussian with identity link, poisson and binomial. #' @param subset an optional `vector` specifying a subset of observations to be used in the fitting process. #' @param strata an optional `vector` specifying strata. -#' @param weights an optional `vector` of prior weights to be used in the fitting process. Should be NULL or a numeric vector. It is assumed that this vector contains frequency or analytic weights +#' @param weights an optional `vector` of prior weights to be used in the fitting process. Should be NULL or a numeric vector. It is assumed that this vector contains frequency or analytic weights. #' @param na_action a function which indicates what should happen when the data contain `NAs`. -#' @param control_selection a `list` indicating parameters to use in fitting selection model for propensity scores -#' @param control_outcome a `list` indicating parameters to use in fitting model for outcome variable -#' @param control_inference a `list` indicating parameters to use in inference based on probability and non-probability samples, contains parameters such as estimation method or variance method -#' @param start_selection an optional `vector` with starting values for the parameters of the selection equation -#' @param start_outcome an optional `vector` with starting values for the parameters of the outcome equation -#' @param verbose verbose, numeric +#' @param control_selection a `list` indicating parameters to use in fitting selection model for propensity scores. +#' @param control_outcome a `list` indicating parameters to use in fitting model for outcome variable. +#' @param control_inference a `list` indicating parameters to use in inference based on probability and non-probability samples, contains parameters such as estimation method or variance method. +#' @param start_selection an optional `vector` with starting values for the parameters of the selection equation. +#' @param start_outcome an optional `vector` with starting values for the parameters of the outcome equation. +#' @param verbose verbose, numeric. #' @param x Logical value indicating whether to return model matrix of covariates as a part of output. #' @param y Logical value indicating whether to return vector of outcome variable as a part of output. #' @param se Logical value indicating whether to calculate and return standard error of estimated mean. @@ -188,25 +188,26 @@ NULL #' \item{\code{control} -- list of control functions.} #' \item{\code{output} -- output of the model with information on the estimated population mean and standard errors.} #' \item{\code{SE} -- standard error of the estimator of the population mean, divided into errors from probability and non-probability samples.} -#' \item{\code{confidence_interval} -- confidence interval of population mean estimator} -#' \item{\code{nonprob_size} -- size of non-probability sample} -#' \item{\code{prob_size} -- size of probability sample} -#' \item{\code{pop_size} -- estimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample)} +#' \item{\code{confidence_interval} -- confidence interval of population mean estimator.} +#' \item{\code{nonprob_size} -- size of non-probability sample.} +#' \item{\code{prob_size} -- size of probability sample.} +#' \item{\code{pop_size} -- estimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample).} #' \item{\code{pop_totals} -- the total values of the auxiliary variables derived from a probability sample or vector of total/mean values.} #' \item{\code{outcome} -- list containing information about the fitting of the mass imputation model, in the case of regression model the object containing the list returned by #' [stats::glm()], in the case of the nearest neighbour imputation the object containing list returned by [RANN::nn2()]. If `bias_correction` in [controlInf()] is set to `TRUE`, the estimation is based on #' the joint estimating equations for the `selection` and `outcome` model and therefore, the list is different from the one returned by the [stats::glm()] function and contains elements such as #' \itemize{ -#' \item{\code{coefficients} -- estimated coefficients of the regression model} -#' \item{\code{std_err} -- standard errors of the estimated coefficients} -#' \item{\code{residuals} -- The response residuals} -#' \item{\code{variance_covariance} -- The variance-covariance matrix of the coefficient estimates} -#' \item{\code{df_residual} -- The degrees of freedom for residuals} -#' \item{\code{family} -- specifies the error distribution and link function to be used in the model} -#' \item{\code{fitted.values} -- The predicted values of the response variable based on the fitted model} -#' \item{\code{linear.predictors} -- The linear fit on link scale} -#' \item{\code{X} -- The design matrix} -#' \item{\code{method} -- set on `glm`, since the regression method} +#' \item{\code{coefficients} -- estimated coefficients of the regression model.} +#' \item{\code{std_err} -- standard errors of the estimated coefficients.} +#' \item{\code{residuals} -- The response residuals.} +#' \item{\code{variance_covariance} -- The variance-covariance matrix of the coefficient estimates.} +#' \item{\code{df_residual} -- The degrees of freedom for residuals.} +#' \item{\code{family} -- specifies the error distribution and link function to be used in the model.} +#' \item{\code{fitted.values} -- The predicted values of the response variable based on the fitted model.} +#' \item{\code{linear.predictors} -- The linear fit on link scale.} +#' \item{\code{X} -- The design matrix.} +#' \item{\code{method} -- set on `glm`, since the regression method.} +#' \item{\code{model_frame} -- Matrix of data from probability sample used for mass imputation.} #' } #' } #' In addition, if the variable selection model for the outcome variable is fitting, the list includes the @@ -215,22 +216,30 @@ NULL #' } #' \item{\code{selection} -- list containing information about fitting of propensity score model, such as #' \itemize{ -#' \item{\code{coefficients} -- a named vector of coefficients} -#' \item{\code{std_err} -- standard errors of the estimated model coefficients} -#' \item{\code{residuals} -- the response residuals} -#' \item{\code{variance} -- the root mean square error} +#' \item{\code{coefficients} -- a named vector of coefficients.} +#' \item{\code{std_err} -- standard errors of the estimated model coefficients.} +#' \item{\code{residuals} -- the response residuals.} +#' \item{\code{variance} -- the root mean square error.} #' \item{\code{fitted_values} -- the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.} #' \item{\code{link} -- the `link` object used.} #' \item{\code{linear_predictors} -- the linear fit on link scale.} #' \item{\code{aic} -- A version of Akaike's An Information Criterion, minus twice the maximized log-likelihood plus twice the number of parameters.} #' \item{\code{weights} -- vector of estimated weights for non-probability sample.} #' \item{\code{prior.weights} -- the weights initially supplied, a vector of 1s if none were.} -#' \item{\code{est_totals} -- the estimated total values of auxiliary variables derived from a non-probability sample.} +#' \item{\code{est_totals} -- the estimated total values of auxiliary variables derived from a non-probability sample}. #' \item{\code{formula} -- the formula supplied.} #' \item{\code{df_residual} -- the residual degrees of freedom.} #' \item{\code{log_likelihood} -- value of log-likelihood function if `mle` method, in the other case `NA`.} #' \item{\code{cve} -- the error for each value of the `lambda`, averaged across the cross-validation folds for the variable selection model #' when the propensity score model is fitting. Returned only if selection of variables for the model is used.} +#' \item{\code{method_selection} -- Link function, e.g. `logit`, `cloglog` or `probit`.} +#' \item{\code{hessian} -- Hessian Gradient of the log-likelihood function from `mle` method}. +#' \item{\code{gradient} -- Gradient of the log-likelihood function from `mle` method.} +#' \item{\code{method} -- An estimation method for selection model, e.g. `mle` or `gee`.} +#' \item{\code{prob_der} -- Derivative of the inclusion probability function for units in a non--probability sample.} +#' \item{\code{prob_rand} -- Inclusion probabilities for unit from a probabiliy sample from `svydesign` object.} +#' \item{\code{prob_rand_est} -- Inclusion probabilites to a non--probabiliy sample for unit from probability sample.} +#' \item{\code{prob_rand_est_der} -- Derivative of the inclusion probabilites to a non--probabiliy sample for unit from probability sample.} #' } #' } #' \item{\code{stat} -- matrix of the estimated population means in each bootstrap iteration. diff --git a/R/summary.R b/R/summary.R index f80ea35..1d3b9ec 100644 --- a/R/summary.R +++ b/R/summary.R @@ -15,7 +15,7 @@ #' \item \code{call} -- A call which created \code{object}. #' \item \code{pop_total} -- A list containing information about the estimated population mean, its standard error and confidence interval. #' \item \code{sample_size} -- The size of the samples used in the model. -#' \item \code{population_size} -- The estimated size of the population from which the nonoprobability sample was drawn. +#' \item \code{population_size} -- The estimated size of the population from which the non--probability sample was drawn. #' \item \code{test} -- Type of statistical test performed. #' \item \code{control} -- A List of control parameters used in fitting the model. #' \item \code{model} -- A descriptive name of the model used, e.g., "Doubly-Robust", "Inverse probability weighted", or "Mass Imputation". diff --git a/man/nonprob.Rd b/man/nonprob.Rd index b9cb25c..358e7d0 100644 --- a/man/nonprob.Rd +++ b/man/nonprob.Rd @@ -49,9 +49,9 @@ nonprob( \item{pop_size}{an optional \code{double} with population size.} -\item{method_selection}{a \code{character} with method for propensity scores estimation} +\item{method_selection}{a \code{character} with method for propensity scores estimation.} -\item{method_outcome}{a \code{character} with method for response variable estimation} +\item{method_outcome}{a \code{character} with method for response variable estimation.} \item{family_outcome}{a \code{character} string describing the error distribution and link function to be used in the model. Default is "gaussian". Currently supports: gaussian with identity link, poisson and binomial.} @@ -59,21 +59,21 @@ nonprob( \item{strata}{an optional \code{vector} specifying strata.} -\item{weights}{an optional \code{vector} of prior weights to be used in the fitting process. Should be NULL or a numeric vector. It is assumed that this vector contains frequency or analytic weights} +\item{weights}{an optional \code{vector} of prior weights to be used in the fitting process. Should be NULL or a numeric vector. It is assumed that this vector contains frequency or analytic weights.} \item{na_action}{a function which indicates what should happen when the data contain \code{NAs}.} -\item{control_selection}{a \code{list} indicating parameters to use in fitting selection model for propensity scores} +\item{control_selection}{a \code{list} indicating parameters to use in fitting selection model for propensity scores.} -\item{control_outcome}{a \code{list} indicating parameters to use in fitting model for outcome variable} +\item{control_outcome}{a \code{list} indicating parameters to use in fitting model for outcome variable.} -\item{control_inference}{a \code{list} indicating parameters to use in inference based on probability and non-probability samples, contains parameters such as estimation method or variance method} +\item{control_inference}{a \code{list} indicating parameters to use in inference based on probability and non-probability samples, contains parameters such as estimation method or variance method.} -\item{start_selection}{an optional \code{vector} with starting values for the parameters of the selection equation} +\item{start_selection}{an optional \code{vector} with starting values for the parameters of the selection equation.} -\item{start_outcome}{an optional \code{vector} with starting values for the parameters of the outcome equation} +\item{start_outcome}{an optional \code{vector} with starting values for the parameters of the outcome equation.} -\item{verbose}{verbose, numeric} +\item{verbose}{verbose, numeric.} \item{x}{Logical value indicating whether to return model matrix of covariates as a part of output.} @@ -97,25 +97,26 @@ with type \code{list} containing:\cr \item{\code{control} -- list of control functions.} \item{\code{output} -- output of the model with information on the estimated population mean and standard errors.} \item{\code{SE} -- standard error of the estimator of the population mean, divided into errors from probability and non-probability samples.} -\item{\code{confidence_interval} -- confidence interval of population mean estimator} -\item{\code{nonprob_size} -- size of non-probability sample} -\item{\code{prob_size} -- size of probability sample} -\item{\code{pop_size} -- estimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample)} +\item{\code{confidence_interval} -- confidence interval of population mean estimator.} +\item{\code{nonprob_size} -- size of non-probability sample.} +\item{\code{prob_size} -- size of probability sample.} +\item{\code{pop_size} -- estimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample).} \item{\code{pop_totals} -- the total values of the auxiliary variables derived from a probability sample or vector of total/mean values.} \item{\code{outcome} -- list containing information about the fitting of the mass imputation model, in the case of regression model the object containing the list returned by \code{\link[stats:glm]{stats::glm()}}, in the case of the nearest neighbour imputation the object containing list returned by \code{\link[RANN:nn2]{RANN::nn2()}}. If \code{bias_correction} in \code{\link[=controlInf]{controlInf()}} is set to \code{TRUE}, the estimation is based on the joint estimating equations for the \code{selection} and \code{outcome} model and therefore, the list is different from the one returned by the \code{\link[stats:glm]{stats::glm()}} function and contains elements such as \itemize{ -\item{\code{coefficients} -- estimated coefficients of the regression model} -\item{\code{std_err} -- standard errors of the estimated coefficients} -\item{\code{residuals} -- The response residuals} -\item{\code{variance_covariance} -- The variance-covariance matrix of the coefficient estimates} -\item{\code{df_residual} -- The degrees of freedom for residuals} -\item{\code{family} -- specifies the error distribution and link function to be used in the model} -\item{\code{fitted.values} -- The predicted values of the response variable based on the fitted model} -\item{\code{linear.predictors} -- The linear fit on link scale} -\item{\code{X} -- The design matrix} -\item{\code{method} -- set on \code{glm}, since the regression method} +\item{\code{coefficients} -- estimated coefficients of the regression model.} +\item{\code{std_err} -- standard errors of the estimated coefficients.} +\item{\code{residuals} -- The response residuals.} +\item{\code{variance_covariance} -- The variance-covariance matrix of the coefficient estimates.} +\item{\code{df_residual} -- The degrees of freedom for residuals.} +\item{\code{family} -- specifies the error distribution and link function to be used in the model.} +\item{\code{fitted.values} -- The predicted values of the response variable based on the fitted model.} +\item{\code{linear.predictors} -- The linear fit on link scale.} +\item{\code{X} -- The design matrix.} +\item{\code{method} -- set on \code{glm}, since the regression method.} +\item{\code{model_frame} -- Matrix of data from probability sample used for mass imputation.} } } In addition, if the variable selection model for the outcome variable is fitting, the list includes the @@ -124,22 +125,30 @@ In addition, if the variable selection model for the outcome variable is fitting } \item{\code{selection} -- list containing information about fitting of propensity score model, such as \itemize{ -\item{\code{coefficients} -- a named vector of coefficients} -\item{\code{std_err} -- standard errors of the estimated model coefficients} -\item{\code{residuals} -- the response residuals} -\item{\code{variance} -- the root mean square error} +\item{\code{coefficients} -- a named vector of coefficients.} +\item{\code{std_err} -- standard errors of the estimated model coefficients.} +\item{\code{residuals} -- the response residuals.} +\item{\code{variance} -- the root mean square error.} \item{\code{fitted_values} -- the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.} \item{\code{link} -- the \code{link} object used.} \item{\code{linear_predictors} -- the linear fit on link scale.} \item{\code{aic} -- A version of Akaike's An Information Criterion, minus twice the maximized log-likelihood plus twice the number of parameters.} \item{\code{weights} -- vector of estimated weights for non-probability sample.} \item{\code{prior.weights} -- the weights initially supplied, a vector of 1s if none were.} -\item{\code{est_totals} -- the estimated total values of auxiliary variables derived from a non-probability sample.} +\item{\code{est_totals} -- the estimated total values of auxiliary variables derived from a non-probability sample}. \item{\code{formula} -- the formula supplied.} \item{\code{df_residual} -- the residual degrees of freedom.} \item{\code{log_likelihood} -- value of log-likelihood function if \code{mle} method, in the other case \code{NA}.} \item{\code{cve} -- the error for each value of the \code{lambda}, averaged across the cross-validation folds for the variable selection model when the propensity score model is fitting. Returned only if selection of variables for the model is used.} +\item{\code{method_selection} -- Link function, e.g. \code{logit}, \code{cloglog} or \code{probit}.} +\item{\code{hessian} -- Hessian Gradient of the log-likelihood function from \code{mle} method}. +\item{\code{gradient} -- Gradient of the log-likelihood function from \code{mle} method.} +\item{\code{method} -- An estimation method for selection model, e.g. \code{mle} or \code{gee}.} +\item{\code{prob_der} -- Derivative of the inclusion probability function for units in a non--probability sample.} +\item{\code{prob_rand} -- Inclusion probabilities for unit from a probabiliy sample from \code{svydesign} object.} +\item{\code{prob_rand_est} -- Inclusion probabilites to a non--probabiliy sample for unit from probability sample.} +\item{\code{prob_rand_est_der} -- Derivative of the inclusion probabilites to a non--probabiliy sample for unit from probability sample.} } } \item{\code{stat} -- matrix of the estimated population means in each bootstrap iteration. @@ -152,9 +161,9 @@ Returned only if a bootstrap method is used to estimate the variance and \code{k The function allows you to estimate the population mean with access to a reference probability sample, as well as sums and means of covariates. The package implements state-of-the-art approaches recently proposed in the literature: Chen et al. (2020), -Yang et al. (2020), Wu (2022) and use the \href{https://CRAN.R-project.org/package=survey}{Lumley 2004} \code{survey} package for inference. +Yang et al. (2020), Wu (2022) and uses the \href{https://CRAN.R-project.org/package=survey}{Lumley 2004} \code{survey} package for inference. -It provides propensity score weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbor) and +It provides propensity score weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbour) and doubly robust estimators that take into account minimisation of the asymptotic bias of the population mean estimators or variable selection. The package uses \code{survey} package functionality when a probability sample is available. diff --git a/man/summary.nonprobsvy.Rd b/man/summary.nonprobsvy.Rd index c138af7..b47af02 100644 --- a/man/summary.nonprobsvy.Rd +++ b/man/summary.nonprobsvy.Rd @@ -27,7 +27,7 @@ An object of \code{summary_nonprobsvy} class containing: \item \code{call} -- A call which created \code{object}. \item \code{pop_total} -- A list containing information about the estimated population mean, its standard error and confidence interval. \item \code{sample_size} -- The size of the samples used in the model. -\item \code{population_size} -- The estimated size of the population from which the nonoprobability sample was drawn. +\item \code{population_size} -- The estimated size of the population from which the non--probability sample was drawn. \item \code{test} -- Type of statistical test performed. \item \code{control} -- A List of control parameters used in fitting the model. \item \code{model} -- A descriptive name of the model used, e.g., "Doubly-Robust", "Inverse probability weighted", or "Mass Imputation".