Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New fn to compare data to dictionary of allowed values and return tidy data frame of non-matching values #12

Open
patrickbarks opened this issue May 11, 2020 · 1 comment

Comments

@patrickbarks
Copy link

(This is loosely related to reconhub/linelist#49 perhaps)

I'm realizing that my workflow for generating/updating dictionaries to use with matchmaker::match_df() is inefficient. I basically just rely on printed warnings in the console to identify non-valid value/variable pairs, and copy them into my .csv recoding dictionary.

A better approach might be a function that compares a data frame to a dictionary of allowed values, and returns a tidy date frame of non-allowed values (if any) by variable, which could then be more easily appended to the recoding dictionary.

I've added an initial version to my branch (tentatively called check_df()). If you're interested I can add tests etc. and submit a PR. E.g.

library(matchmaker)
library(tibble)

dict <- tibble::tribble(
  ~value    , ~variable         ,
  "Yes"     , "readmission"     ,
  "No"      , "readmission"     ,
  "Unknown" , "readmission"     ,
  NA        , "readmission"     ,
  "Hosp. 1" , "facility"        ,
  "Hosp. 2" , "facility"        ,
  "Positive", ".regex ^lab_res_",
  "Negative", ".regex ^lab_res_",
  "Inc."    , ".regex ^lab_res_"
)

dat <- tibble::tribble(
  ~id , ~readmission, ~facility, ~lab_res_1, ~lab_res_2,
  "P1", "yes"       , "Hosp. 1" , "Positive", "Negative",
  "P2", "No"        , "Hosp. 2" , "Negative", "negative",
  "P3", NA          , "H2"      , "inc",      "Negative",
  "P4", "Yes"       , "H1"      , NA,         "Negative",
  "P5", "No"        , "Hosp. 1" , "pos",      "Positive"
)

# compare data to dictionary and return non-allowed values
check_df(dat, dict)
#>      value         variable
#> 1       H1         facility
#> 2       H2         facility
#> 3      yes      readmission
#> 4      inc .regex ^lab_res_
#> 5      pos .regex ^lab_res_
#> 6     <NA> .regex ^lab_res_
#> 7 negative .regex ^lab_res_

# also return corresponding allowed values, collapsed to string
check_df(dat, dict, return_allowed = TRUE)
#>      value         variable           values_allowed
#> 1       H1         facility         Hosp. 1; Hosp. 2
#> 2       H2         facility         Hosp. 1; Hosp. 2
#> 3      yes      readmission     Yes; No; Unknown; NA
#> 4      inc .regex ^lab_res_ Positive; Negative; Inc.
#> 5      pos .regex ^lab_res_ Positive; Negative; Inc.
#> 6     <NA> .regex ^lab_res_ Positive; Negative; Inc.
#> 7 negative .regex ^lab_res_ Positive; Negative; Inc.

Created on 2020-05-11 by the reprex package (v0.3.0)

@zkamvar
Copy link
Member

zkamvar commented May 12, 2020

Hi @patrickbarks,

Just checking in to say that I've seen this and I think it's a worthwhile idea, thank you!. Please do submit a PR.

For future reference, it would be better if you created a separate branch on your fork for new features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants