New fn to compare data to dictionary of allowed values and return tidy data frame of non-matching values #12

patrickbarks · 2020-05-11T13:58:25Z

(This is loosely related to reconhub/linelist#49 perhaps)

I'm realizing that my workflow for generating/updating dictionaries to use with matchmaker::match_df() is inefficient. I basically just rely on printed warnings in the console to identify non-valid value/variable pairs, and copy them into my .csv recoding dictionary.

A better approach might be a function that compares a data frame to a dictionary of allowed values, and returns a tidy date frame of non-allowed values (if any) by variable, which could then be more easily appended to the recoding dictionary.

I've added an initial version to my branch (tentatively called check_df()). If you're interested I can add tests etc. and submit a PR. E.g.

library(matchmaker)
library(tibble)

dict <- tibble::tribble(
  ~value    , ~variable         ,
  "Yes"     , "readmission"     ,
  "No"      , "readmission"     ,
  "Unknown" , "readmission"     ,
  NA        , "readmission"     ,
  "Hosp. 1" , "facility"        ,
  "Hosp. 2" , "facility"        ,
  "Positive", ".regex ^lab_res_",
  "Negative", ".regex ^lab_res_",
  "Inc."    , ".regex ^lab_res_"
)

dat <- tibble::tribble(
  ~id , ~readmission, ~facility, ~lab_res_1, ~lab_res_2,
  "P1", "yes"       , "Hosp. 1" , "Positive", "Negative",
  "P2", "No"        , "Hosp. 2" , "Negative", "negative",
  "P3", NA          , "H2"      , "inc",      "Negative",
  "P4", "Yes"       , "H1"      , NA,         "Negative",
  "P5", "No"        , "Hosp. 1" , "pos",      "Positive"
)

# compare data to dictionary and return non-allowed values
check_df(dat, dict)
#>      value         variable
#> 1       H1         facility
#> 2       H2         facility
#> 3      yes      readmission
#> 4      inc .regex ^lab_res_
#> 5      pos .regex ^lab_res_
#> 6     <NA> .regex ^lab_res_
#> 7 negative .regex ^lab_res_

# also return corresponding allowed values, collapsed to string
check_df(dat, dict, return_allowed = TRUE)
#>      value         variable           values_allowed
#> 1       H1         facility         Hosp. 1; Hosp. 2
#> 2       H2         facility         Hosp. 1; Hosp. 2
#> 3      yes      readmission     Yes; No; Unknown; NA
#> 4      inc .regex ^lab_res_ Positive; Negative; Inc.
#> 5      pos .regex ^lab_res_ Positive; Negative; Inc.
#> 6     <NA> .regex ^lab_res_ Positive; Negative; Inc.
#> 7 negative .regex ^lab_res_ Positive; Negative; Inc.

^{Created on 2020-05-11 by the reprex package (v0.3.0)}

The text was updated successfully, but these errors were encountered:

zkamvar · 2020-05-12T17:30:28Z

Hi @patrickbarks,

Just checking in to say that I've seen this and I think it's a worthwhile idea, thank you!. Please do submit a PR.

For future reference, it would be better if you created a separate branch on your fork for new features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New fn to compare data to dictionary of allowed values and return tidy data frame of non-matching values #12

New fn to compare data to dictionary of allowed values and return tidy data frame of non-matching values #12

patrickbarks commented May 11, 2020

zkamvar commented May 12, 2020

New fn to compare data to dictionary of allowed values and return tidy data frame of non-matching values #12

New fn to compare data to dictionary of allowed values and return tidy data frame of non-matching values #12

Comments

patrickbarks commented May 11, 2020

zkamvar commented May 12, 2020