`affirm_dupe_free` updates #46

Meghansaha · 2025-09-11T18:40:50Z

What changes are proposed in this pull request?

Updated affirm_no_dupes() to add record_id and flag_duplicate columns to the output, providing better visibility into duplicate detection logic.

If there is an GitHub issue associated with this pull request, please provide link.

Closes #42

Reviewer Checklist (if item does not apply, mark is as complete)

Ensure all package dependencies are installed by running renv::install()
PR branch has pulled the most recent updates from master branch. Ensure the pull request branch and your local version match and both have the latest updates from the master branch.
If a new function was added, function included in _pkgdown.yml
If a bug was fixed, a unit test was added for the bug check
Run pkgdown::build_site(). Check the R console for errors, and review the rendered website.
Code coverage is suitable for any new functions/features. Review coverage with withr::with_envvar(new = c("NOT_CRAN" = "true"), covr::report()). Begin in a fresh R session without any packages loaded.
R CMD Check runs without errors, warnings, and notes
usethis::use_spell_check() runs with no spelling errors in documentation

When the branch is ready to be merged into master:

Update NEWS.md with the changes from this pull request under the heading "# pcctc (development version)". If there is an issue associated with the pull request, reference it in parentheses at the end update (see NEWS.md for examples).
Increment the version number using usethis::use_version(which = "dev")
Run usethis::use_spell_check() again
Approve Pull Request
Merge the PR. Please use "Squash and merge".
Execute pkgdown::deploy_to_branch() to refresh website to latest version.

Meghansaha · 2025-09-12T19:10:53Z

Important

@shannonpileggi, please review this update to affirm_dupe_free() when time allows.

Per discussion in #42, I've modified affirm_dupe_free() (previously named affirm_no_dupes()) to add three new columns (flag_duplicate, duplicate_of, record_id) to both the data output and exported Excel report.

Key Files Changed:

affirm_dupe_free.R - Diff Here
tests/testthat/test-affirm_dupe_free.R - Diff Here
NEWS.md - Diff Here
DESCRIPTION for version bump, man (.Rd), and snap files for fx name change

Implementation Notes

Renamed affirm_no_dupes() to affirm_dupe_free()and inverted the original function logic in the condition quosure to ensure accurate console messaging, since all affirm functions wrap affirm_true().
Added duplicate_of variable showing the record_id of the first occurrence for each duplicate row. Since report listings only show flagged rows, this helps trace duplicates back to the original occurrence in the full dataset.

Validation

R CMD Check passes:

Test coverage maintained:

Examples

With duplicates:

library(devtools)
load_all()
library(dplyr)

# Initialize affirm session
affirm_init(replace = TRUE)

df_test <-
  tibble::tibble(
    subject = c("123-45-678", "234-56-789", "345-67-890", "375-13-532", "123-45-678", "345-67-890"),
    event = c("nausea", "pain", "covid", "cardiac arrest", "nausea", "covid"),
    grade  = c(1,1,3,4,1,3),
    date = c("2025-02-01", "2025-01-23", "2024-05-08", "2023-06-07", "2025-02-01", "2024-05-08")
  )

# When duplicates are present
df_test |>
  affirm_dupe_free(
    label = "No duplicates event and dates fopr subjects",
    columns = everything()
  )

# Export the report
affirm_report_excel(
  "test.xlsx", affirmation_name = "df_1"
)

Returns:

• No duplicates event and dates for subjects
  2 issues identified.
# A tibble: 6 × 7
  subject    event          grade date       flag_duplicate duplicate_of record_id
  <chr>      <chr>          <dbl> <chr>      <lgl>                 <int>     <int>
1 123-45-678 nausea             1 2025-02-01 FALSE                    NA         1
2 234-56-789 pain               1 2025-01-23 FALSE                    NA         2
3 345-67-890 covid              3 2024-05-08 FALSE                    NA         3
4 375-13-532 cardiac arrest     4 2023-06-07 FALSE                    NA         4
5 123-45-678 nausea             1 2025-02-01 TRUE                      1         5
6 345-67-890 covid              3 2024-05-08 TRUE                      3         6

Without duplicates:

# Initialize affirm session
affirm_init(replace = TRUE)

# When duplicates are present
df_test |>
  distinct() |>
  affirm_dupe_free(
    label = "No duplicates event and dates for subjects",
    columns = everything()
  )

# Export the report
affirm_report_excel(
  "test2.xlsx", affirmation_name = "df_2"
)

Returns:

• No duplicates event and dates for subjects
  0 issues identified.
# A tibble: 4 × 7
  subject    event          grade date       flag_duplicate duplicate_of record_id
  <chr>      <chr>          <dbl> <chr>      <lgl>                 <int>     <int>
1 123-45-678 nausea             1 2025-02-01 FALSE                    NA         1
2 234-56-789 pain               1 2025-01-23 FALSE                    NA         2
3 345-67-890 covid              3 2024-05-08 FALSE                    NA         3
4 375-13-532 cardiac arrest     4 2023-06-07 FALSE                    NA         4

Excel report (with duplicates):

Excel report (without duplicates):

Returns no rows, consistent with other affirm functions.

shannonpileggi

hey @Meghansaha! this is looking better. however, it seems that the columns argument is used inconsistently across functions .

affirm_true, columns = represents "string of column names used in affirmation"

it isn't super clear to me what this means and we dont have a lot of documentation/testing on this, would be good to flesh out

affirm_dupe_free, columns = represents "columns to check duplicates among"

affirm_class, columns = represents "columns to check class"

it is probably best to discuss

what should be expected behavior across functions
what can/should go in this PR, vs go in another PR

Example 1

df_test <- mtcars |>
  tibble::rownames_to_column() |> 
  dplyr::arrange(cyl, hp) 


options(
  'affirm.id_cols' =  c(
    "rowname", "mpg"
  )
)

affirm_init(replace = TRUE)

df_test |>
  affirm_dupe_free(
    label = "No duplicates in the number of cylinders & hp",
    columns = c(cyl,hp),
    id = 1,
    data_frames = "mtcars"
  )


df_test |> 
  affirm_true(
    label = "mpg <= 30",
    condition = mpg <= 30,
    id = 2,
    data_frames = "mtcars",
    # these do not work
    #columns = c("rowname", "mpg"),
    #columns = everything()
  )

affirm_report_excel("test.xlsx")

Affirmation 1 (affirm_dupe_free) - checks cyl & hp for duplicates; ignores affirm.id_cols in output creation; outputs flag_duplicate which is inconsistent with affirmation 2 (no flag output)

Affirmation 2 (affirm_true) - columns argument doesn't work; affirm.id_cols respected; specification of affirm.id_cols differs from columns specification in affirm_dupe_free (bare vs quoted variable names)

Example 2

affirm_init(replace = TRUE)

affirm_class(
  dplyr::as_tibble(iris),
  label = "all cols are numeric (but Species really isn't)",
  columns = everything(),
  data_frames = "iris",
  id = 1,
  class = "numeric"
)

affirm_report_excel("test_iris.xlsx")

affirm_class() behaves more like affirm_dupe_free(). all columns are tested but not output.

shannonpileggi · 2025-10-20T16:41:01Z

.gitignore

 .Ruserdata
 docs
 inst/doc
+affirm.Rproj


why did this get added here? this is something we push

I had updated RStudio around the time I was working on this. That update added a projectID field to the .Rproj file that our current .gitignore isn't catching:

affirm/.gitignore

Lines 1 to 6 in 7b328b2

.Rproj.user

.Rhistory

.RData

.Ruserdata

docs

inst/doc

Should we commit this projectID or update the gitignore to exclude .Rproj files entirely? Tried to find some info on this, : rstudio/rstudio#15524

But still fuzzy on what the implications of projectID are.

Let's discuss.

Meghansaha added 6 commits September 11, 2025 14:36

updated affirm_no_dupes data and report output

ff6a18c

Update affirm_report.png

28e85df

cleanup undefined global variables

89e2432

site/doc updates

3845838

report listing tweaks

daf0863

update doc title

44ddfe9

Meghansaha marked this pull request as ready for review September 12, 2025 19:10

Meghansaha added 4 commits September 15, 2025 14:00

add in duplicate_of

ce77fc9

rebuild NEWS.md

b348fdf

rename affirm_no_dupes to affirm_dupe_free

8d8248f

cleanup

fdd42fd

Meghansaha changed the title ~~affirm_no_dupes updates~~ affirm_dupe_free updates Oct 1, 2025

version bump and tweak news

ea4b9a3

shannonpileggi requested changes Oct 20, 2025

View reviewed changes

Meghansaha marked this pull request as draft October 30, 2025 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`affirm_dupe_free` updates #46

`affirm_dupe_free` updates #46

Uh oh!

Meghansaha commented Sep 11, 2025 •

edited

Loading

Uh oh!

Meghansaha commented Sep 12, 2025 •

edited

Loading

R CMD Check passes:

Test coverage maintained:

With duplicates:

Without duplicates:

Excel report (with duplicates):

Excel report (without duplicates):

Uh oh!

shannonpileggi left a comment

Uh oh!

shannonpileggi Oct 20, 2025

Uh oh!

Meghansaha Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

affirm_dupe_free updates #46

Are you sure you want to change the base?

affirm_dupe_free updates #46

Uh oh!

Conversation

Meghansaha commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Meghansaha commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Files Changed:

Implementation Notes

Validation

R CMD Check passes:

Test coverage maintained:

Examples

With duplicates:

Without duplicates:

Excel report (with duplicates):

Excel report (without duplicates):

Uh oh!

shannonpileggi left a comment

Choose a reason for hiding this comment

Example 1

Example 2

Uh oh!

shannonpileggi Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Meghansaha Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`affirm_dupe_free` updates #46

`affirm_dupe_free` updates #46

Meghansaha commented Sep 11, 2025 •

edited

Loading

Meghansaha commented Sep 12, 2025 •

edited

Loading