Skip to content

Conversation

@Meghansaha
Copy link
Contributor

@Meghansaha Meghansaha commented Sep 11, 2025

What changes are proposed in this pull request?

Updated affirm_no_dupes() to add record_id and flag_duplicate columns to the output, providing better visibility into duplicate detection logic.

If there is an GitHub issue associated with this pull request, please provide link.

Closes #42


Reviewer Checklist (if item does not apply, mark is as complete)

  • Ensure all package dependencies are installed by running renv::install()
  • PR branch has pulled the most recent updates from master branch. Ensure the pull request branch and your local version match and both have the latest updates from the master branch.
  • If a new function was added, function included in _pkgdown.yml
  • If a bug was fixed, a unit test was added for the bug check
  • Run pkgdown::build_site(). Check the R console for errors, and review the rendered website.
  • Code coverage is suitable for any new functions/features. Review coverage with withr::with_envvar(new = c("NOT_CRAN" = "true"), covr::report()). Begin in a fresh R session without any packages loaded.
  • R CMD Check runs without errors, warnings, and notes
  • usethis::use_spell_check() runs with no spelling errors in documentation

When the branch is ready to be merged into master:

  • Update NEWS.md with the changes from this pull request under the heading "# pcctc (development version)". If there is an issue associated with the pull request, reference it in parentheses at the end update (see NEWS.md for examples).
  • Increment the version number using usethis::use_version(which = "dev")
  • Run usethis::use_spell_check() again
  • Approve Pull Request
  • Merge the PR. Please use "Squash and merge".
  • Execute pkgdown::deploy_to_branch() to refresh website to latest version.

@Meghansaha
Copy link
Contributor Author

Meghansaha commented Sep 12, 2025

Important

@shannonpileggi, please review this update to affirm_dupe_free() when time allows.

Per discussion in #42, I've modified affirm_dupe_free() (previously named affirm_no_dupes()) to add three new columns (flag_duplicate, duplicate_of, record_id) to both the data output and exported Excel report.



Key Files Changed:

  • affirm_dupe_free.R - Diff Here
  • tests/testthat/test-affirm_dupe_free.R - Diff Here
  • NEWS.md - Diff Here
  • DESCRIPTION for version bump, man (.Rd), and snap files for fx name change




Implementation Notes

  • Renamed affirm_no_dupes() to affirm_dupe_free()and inverted the original function logic in the condition quosure to ensure accurate console messaging, since all affirm functions wrap affirm_true().

  • Added duplicate_of variable showing the record_id of the first occurrence for each duplicate row. Since report listings only show flagged rows, this helps trace duplicates back to the original occurrence in the full dataset.




Validation

R CMD Check passes:

image

Test coverage maintained:

image




Examples

With duplicates:

library(devtools)
load_all()
library(dplyr)

# Initialize affirm session
affirm_init(replace = TRUE)

df_test <-
  tibble::tibble(
    subject = c("123-45-678", "234-56-789", "345-67-890", "375-13-532", "123-45-678", "345-67-890"),
    event = c("nausea", "pain", "covid", "cardiac arrest", "nausea", "covid"),
    grade  = c(1,1,3,4,1,3),
    date = c("2025-02-01", "2025-01-23", "2024-05-08", "2023-06-07", "2025-02-01", "2024-05-08")
  )

# When duplicates are present
df_test |>
  affirm_dupe_free(
    label = "No duplicates event and dates fopr subjects",
    columns = everything()
  )

# Export the report
affirm_report_excel(
  "test.xlsx", affirmation_name = "df_1"
)

Returns:

No duplicates event and dates for subjects
  2 issues identified.
# A tibble: 6 × 7
  subject    event          grade date       flag_duplicate duplicate_of record_id
  <chr>      <chr>          <dbl> <chr>      <lgl>                 <int>     <int>
1 123-45-678 nausea             1 2025-02-01 FALSE                    NA         1
2 234-56-789 pain               1 2025-01-23 FALSE                    NA         2
3 345-67-890 covid              3 2024-05-08 FALSE                    NA         3
4 375-13-532 cardiac arrest     4 2023-06-07 FALSE                    NA         4
5 123-45-678 nausea             1 2025-02-01 TRUE                      1         5
6 345-67-890 covid              3 2024-05-08 TRUE                      3         6

Without duplicates:

# Initialize affirm session
affirm_init(replace = TRUE)

# When duplicates are present
df_test |>
  distinct() |>
  affirm_dupe_free(
    label = "No duplicates event and dates for subjects",
    columns = everything()
  )

# Export the report
affirm_report_excel(
  "test2.xlsx", affirmation_name = "df_2"
)

Returns:

No duplicates event and dates for subjects
  0 issues identified.
# A tibble: 4 × 7
  subject    event          grade date       flag_duplicate duplicate_of record_id
  <chr>      <chr>          <dbl> <chr>      <lgl>                 <int>     <int>
1 123-45-678 nausea             1 2025-02-01 FALSE                    NA         1
2 234-56-789 pain               1 2025-01-23 FALSE                    NA         2
3 345-67-890 covid              3 2024-05-08 FALSE                    NA         3
4 375-13-532 cardiac arrest     4 2023-06-07 FALSE                    NA         4

Excel report (with duplicates):

image

Excel report (without duplicates):

image

Returns no rows, consistent with other affirm functions.

@Meghansaha Meghansaha marked this pull request as ready for review September 12, 2025 19:10
@Meghansaha Meghansaha changed the title affirm_no_dupes updates affirm_dupe_free updates Oct 1, 2025
Copy link
Contributor

@shannonpileggi shannonpileggi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @Meghansaha! this is looking better. however, it seems that the columns argument is used inconsistently across functions .

affirm_true, columns = represents "string of column names used in affirmation"

  • it isn't super clear to me what this means and we dont have a lot of documentation/testing on this, would be good to flesh out

affirm_dupe_free, columns = represents "columns to check duplicates among"

affirm_class, columns = represents "columns to check class"

it is probably best to discuss

  1. what should be expected behavior across functions
  2. what can/should go in this PR, vs go in another PR

Example 1

df_test <- mtcars |>
  tibble::rownames_to_column() |> 
  dplyr::arrange(cyl, hp) 


options(
  'affirm.id_cols' =  c(
    "rowname", "mpg"
  )
)

affirm_init(replace = TRUE)

df_test |>
  affirm_dupe_free(
    label = "No duplicates in the number of cylinders & hp",
    columns = c(cyl,hp),
    id = 1,
    data_frames = "mtcars"
  )


df_test |> 
  affirm_true(
    label = "mpg <= 30",
    condition = mpg <= 30,
    id = 2,
    data_frames = "mtcars",
    # these do not work
    #columns = c("rowname", "mpg"),
    #columns = everything()
  )

affirm_report_excel("test.xlsx")

Affirmation 1 (affirm_dupe_free) - checks cyl & hp for duplicates; ignores affirm.id_cols in output creation; outputs flag_duplicate which is inconsistent with affirmation 2 (no flag output)

Affirmation 2 (affirm_true) - columns argument doesn't work; affirm.id_cols respected; specification of affirm.id_cols differs from columns specification in affirm_dupe_free (bare vs quoted variable names)

Example 2

affirm_init(replace = TRUE)

affirm_class(
  dplyr::as_tibble(iris),
  label = "all cols are numeric (but Species really isn't)",
  columns = everything(),
  data_frames = "iris",
  id = 1,
  class = "numeric"
)

affirm_report_excel("test_iris.xlsx")

affirm_class() behaves more like affirm_dupe_free(). all columns are tested but not output.

.Ruserdata
docs
inst/doc
affirm.Rproj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this get added here? this is something we push

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had updated RStudio around the time I was working on this. That update added a projectID field to the .Rproj file that our current .gitignore isn't catching:

affirm/.gitignore

Lines 1 to 6 in 7b328b2

.Rproj.user
.Rhistory
.RData
.Ruserdata
docs
inst/doc

Should we commit this projectID or update the gitignore to exclude .Rproj files entirely? Tried to find some info on this, : rstudio/rstudio#15524

But still fuzzy on what the implications of projectID are.

Let's discuss.

@Meghansaha Meghansaha marked this pull request as draft October 30, 2025 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve default affirm_no_dupes outputs

2 participants