-
Notifications
You must be signed in to change notification settings - Fork 4
affirm_dupe_free updates
#46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
ff6a18c
28e85df
89e2432
3845838
daf0863
44ddfe9
ce77fc9
b348fdf
8d8248f
fdd42fd
ea4b9a3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,3 +4,4 @@ | |
| .Ruserdata | ||
| docs | ||
| inst/doc | ||
| affirm.Rproj | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,6 @@ | ||
| Package: affirm | ||
| Title: Secular affirmations against data | ||
| Version: 0.2.1 | ||
| Version: 0.3.0 | ||
| Authors@R: c( | ||
| person("Daniel D.", "Sjoberg", , "[email protected]", role = "aut", | ||
| comment = c(ORCID = "0000-0003-0862-2018")), | ||
|
|
@@ -39,4 +39,4 @@ Config/testthat/edition: 3 | |
| Encoding: UTF-8 | ||
| LazyData: true | ||
| Roxygen: list(markdown = TRUE) | ||
| RoxygenNote: 7.3.2 | ||
| RoxygenNote: 7.3.3 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| #' Affirm No Duplicates | ||
| #' | ||
| #' A wrapper for `affirm_true()`. | ||
| #' The columns argument specifies which columns to check for duplicates. The function | ||
| #' creates a record ID, `record_id` for each row, then identifies whether each row represents | ||
| #' the first occurrence of a unique combination of values in the specified columns. | ||
| #' The resulting logical vector, `flag_duplicate` is passed to `affirm_true()`. | ||
| #' | ||
| #' @inheritParams affirm_true | ||
| #' @param columns columns to check duplicates among | ||
| #' @param id,priority,data_frames Optional additional information that will be passed to affirmation report. | ||
| #' - `id` must be an integer, e.g. `id = 1L` | ||
| #' - `priority` must be an integer, e.g. `priority = 1L` | ||
| #' - `data_frames` string of data frame names used in affirmation, e.g. `data_frames = "RAND, DM"` | ||
| #' | ||
| #' @return data frame | ||
| #' @export | ||
| #' @family Data Affirmations | ||
| #' | ||
| #' @section Using `affirm_dupe_free()` to detect duplicate values in specified columns: | ||
| #' `affirm_dupe_free()` adds three columns to the output data: | ||
| #' | ||
| #' \itemize{ | ||
| #' \item **`record_id`:** The original row number from the input data frame. | ||
| #' \item **`flag_duplicate`:** A Boolean (`TRUE`/`FALSE`) indicating whether this row | ||
| #' is a duplicate. The first occurrence of each unique combination is `FALSE`, | ||
| #' while subsequent duplicates are `TRUE`. | ||
| #' \item **`duplicate_of`:** For duplicate rows, the `record_id` of the first | ||
| #' occurrence of this combination. `NA` for non-duplicate rows. | ||
| #' } | ||
| #' | ||
| #' @examples | ||
| #' affirm_init(replace = TRUE) | ||
| #' | ||
| #' dplyr::as_tibble(mtcars) |> | ||
| #' dplyr::select(-c(am, vs)) |> | ||
| #' dplyr::arrange(cyl) |> | ||
| #' affirm_dupe_free( | ||
| #' label = "No duplicates in the number of cylinders", | ||
| #' columns = cyl | ||
| #' ) | ||
| #' | ||
| #' affirm_close() | ||
| #' | ||
| affirm_dupe_free <- function(data, | ||
| label, | ||
| columns, | ||
| id = NA_integer_, | ||
| priority = NA_integer_, | ||
| data_frames = NA_character_, | ||
| report_listing = NULL, | ||
| data_action = NULL, | ||
| error = getOption("affirm.error", default = FALSE)) { | ||
| # check and process inputs --------------------------------------------------- | ||
| if (missing(data) || missing(columns) || missing(label)) { | ||
| cli::cli_abort("Arguments {.code data}, {.code label}, and {.code columns} are required.") | ||
| } | ||
| columns <- dplyr::select(data, {{ columns }}) |> colnames() | ||
| if (rlang::is_empty(columns)) { | ||
| cli::cli_abort("The {.code columm} argument must select at least one column from {.code data}.") | ||
| } | ||
| data_action <- rlang::enquo(data_action) | ||
| report_listing <- rlang::enquo(report_listing) | ||
| if (.is_quo_null(report_listing)) | ||
| report_listing <- | ||
| rlang::quo( | ||
| dplyr::mutate(., record_id = dplyr::row_number()) |> | ||
| dplyr::mutate( | ||
| .by = c(all_of(!!columns)), | ||
| row_num = dplyr::row_number(), | ||
| flag_duplicate = .data$row_num != 1, | ||
| duplicate_of = ifelse(.data$flag_duplicate, min(.data$record_id), NA_integer_) | ||
| ) |> | ||
| dplyr::filter(!lgl_condition) |> | ||
| dplyr::select(-"row_num") |> | ||
| dplyr::relocate("flag_duplicate", "duplicate_of", "record_id", .after = last_col()) | ||
| ) |> | ||
| structure(.Environment = rlang::caller_env()) | ||
|
|
||
| # construct `condition=` argument -------------------------------------------- | ||
| quo_condition <- | ||
| rlang::quo( | ||
| dplyr::mutate(., record_id = dplyr::row_number()) |> | ||
| dplyr::select(all_of(!!columns), "record_id") |> | ||
| dplyr::mutate( | ||
| .by = c(all_of(!!columns)), | ||
| row_num = dplyr::row_number(), | ||
| flag_duplicate = .data$row_num == 1 | ||
| ) |> | ||
| dplyr::pull("flag_duplicate") | ||
| ) |> | ||
| structure(.Environment = rlang::caller_env()) | ||
|
|
||
| # Add dupe info to the actual data output ------------------------------------ | ||
| data_out <- | ||
| data |> | ||
| dplyr::mutate(record_id = dplyr::row_number()) |> | ||
| dplyr::mutate( | ||
| .by = c(all_of(!!columns)), | ||
| row_num = dplyr::row_number(), | ||
| flag_duplicate = .data$row_num != 1, | ||
| duplicate_of = ifelse(.data$flag_duplicate, min(.data$record_id), NA_integer_) | ||
| ) |> | ||
| dplyr::select(-"row_num") |> | ||
| dplyr::relocate("flag_duplicate", "duplicate_of", "record_id", .after = last_col()) | ||
|
|
||
| # pass arguments to affirm_true() -------------------------------------------- | ||
| affirm_true(data = data_out, | ||
| label = label, | ||
| condition = !!quo_condition, | ||
| id = id, | ||
| priority = priority, | ||
| data_frames = data_frames, | ||
| columns = paste(columns, collapse = ", "), | ||
| report_listing = !!report_listing, | ||
| data_action = !!data_action, | ||
| error = error) | ||
| } |
This file was deleted.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this get added here? this is something we push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had updated RStudio around the time I was working on this. That update added a projectID field to the .Rproj file that our current .gitignore isn't catching:
affirm/.gitignore
Lines 1 to 6 in 7b328b2
Should we commit this projectID or update the gitignore to exclude .Rproj files entirely? Tried to find some info on this, : rstudio/rstudio#15524
But still fuzzy on what the implications of projectID are.
Let's discuss.