Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curate titlecase #1197

Merged
merged 5 commits into from
Jul 28, 2023
Merged

Curate titlecase #1197

merged 5 commits into from
Jul 28, 2023

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Apr 7, 2023

Description of proposed changes

Adds a new sub-command augur curate titlecase based on the transform-string-fields script in the monkeypox repo. The augur curate normalize sub-command has already been added based on the same script (#1039).

See commits for details.

Related issue(s)

Closes #999
Related to #860

Testing

Added Cram tests for the new sub-command and includes doctests.

Checklist

  • Add a message in CHANGES.md about the new augur curate titlecase sub-command.

@j23414 j23414 requested a review from a team April 7, 2023 15:11
@codecov
Copy link

codecov bot commented Apr 7, 2023

Codecov Report

Patch coverage: 96.87% and project coverage change: +0.24% 🎉

Comparison is base (4f5559a) 69.36% compared to head (e091e9c) 69.60%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1197      +/-   ##
==========================================
+ Coverage   69.36%   69.60%   +0.24%     
==========================================
  Files          66       67       +1     
  Lines        7024     7087      +63     
  Branches     1708     1726      +18     
==========================================
+ Hits         4872     4933      +61     
- Misses       1847     1848       +1     
- Partials      305      306       +1     
Files Changed Coverage Δ
augur/curate/titlecase.py 96.82% <96.82%> (ø)
augur/curate/__init__.py 88.88% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joverlee521 joverlee521 self-requested a review April 7, 2023 19:54
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for learning the ropes of the Augur curate framework with this titlecase command!

The four commits here feel like they should be just one. Most of the commits after the first are fixes for the first. We also don't typically use the feat:, fix:, etc. prefixes in Augur. The loose convention is to prefix with the major or primary "area" that the commit touches; for example curate:, docs:, io/metadata:, etc. Jover's commits introducing the Augur curate commands are a good recent example.

I left a bunch of notes as I read thru the PR below (but haven't run the code myself).

augur/curate/titlecase.py Outdated Show resolved Hide resolved
augur/curate/titlecase.py Outdated Show resolved Hide resolved
augur/curate/titlecase.py Outdated Show resolved Hide resolved
augur/curate/titlecase.py Show resolved Hide resolved
record_id = index

for field in args.titlecase_fields:
titlecased_string = titlecase(record.get(field, ""), articles, abbreviations)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why fall back to the empty string instead of None? This would end up converting from a null value in the JSON record to an empty string value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we did it this way to be able to check for failures below with

if titlecased_string is None:

Looking at this now, it does seem like a weird side-effect to change null values to empty strings here.
Since the titlecase function only ever returns None if the field value is not a string, we could be checking for the value type outside of the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would end up converting from a null value in the JSON record to an empty string value.

change null values to empty strings

Sorry it took me a while to circle back on this. Looking at python's dictionary get function, the record.get(field,"") is checking if a field (e.g. "division") exists but not the field's value (null or otherwise).

Because without changing anything, titlecase errors out as expected on null and bare integer values:

Test case that fails on null

  $ echo '{"empty": "", "null_entry":null  }' \
  >   | ${AUGUR} curate titlecase --titlecase-fields "empty" "null_entry"
  ERROR: Failed to titlecase 'null_entry':None in record 0
  [2]

Test case that fails on a non-string int

  $ echo '{"bare_int": 2021}' \
  >   | ${AUGUR} curate titlecase --titlecase-fields "bare_int"
  ERROR: Failed to titlecase 'bare_int':2021 in record 0
  [2]

The record.get(field,"") allows non-existent fields to pass (e.g. "division", "not exist")

Test cases when fields do not exist
Decide if this should error out and might affect configs like:
# https://github.com/nextstrain/monkeypox/blob/9e6c71d039cf4ef30ccb75749b634ffc9f2cf8ac/ingest/config/config.yaml#L31

  $ echo '{"region":"europe", "country":"france" }' \
  >   | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
  {"region": "Europe", "country": "France", "division": "", "location": "", "not exist": ""}

If the line is changed to record.get(field), then non-existent fields will error out.

$ echo '{"region":"europe", "country":"france" }' \
  >   | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
ERROR: Failed to titlecase 'division':None in record 0

I think we want the non-existent fields to error out? This decision may affect how we call augur curate titlecase in monkeypox and other pathogen ingest pipelines, or maybe not. Open to thoughts and suggestions here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, using record.get(field) will not allow you to differentiate when a field simply does not exist or when the field value is null. We can separate these two if we want different behaviors or different error messages for each case:

if field not in record:
    # do something if field does not exist
elif record[field] is None:
    # do something if field value is null

augur/curate/titlecase.py Outdated Show resolved Hide resolved
tests/functional/curate/cram/titlecase.t Outdated Show resolved Hide resolved
docs/usage/cli/curate/titlecase.rst Outdated Show resolved Hide resolved
tests/functional/curate/cram/titlecase.t Outdated Show resolved Hide resolved
augur/curate/titlecase.py Outdated Show resolved Hide resolved
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working through this @j23414! I hope this was a good small intro for the Augur ecosystem 🙏

I left a couple of minor comments, but the one thing that stands out (thanks to Tom's comment on the None vs empty string values) is that the only reason for failure for this command seems to be when the field value is not a string. With that in mind, I think the help/error messages can be more explicit on requiring titlecase fields to be string values.

I think this also leads to an open question of if the command should handle null/None values separately. I imagine there's use cases for both allowing null values and for erroring on null values in the titlecase fields separate from other type errors? (This doesn't need to be addressed in this PR, just thinking out loud)

augur/curate/titlecase.py Outdated Show resolved Hide resolved
record_id = index

for field in args.titlecase_fields:
titlecased_string = titlecase(record.get(field, ""), articles, abbreviations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we did it this way to be able to check for failures below with

if titlecased_string is None:

Looking at this now, it does seem like a weird side-effect to change null values to empty strings here.
Since the titlecase function only ever returns None if the field value is not a string, we could be checking for the value type outside of the function.

augur/curate/titlecase.py Outdated Show resolved Hide resolved
augur/curate/titlecase.py Outdated Show resolved Hide resolved
if failure_reporting is DataErrorMethod.WARN:
print_err(f"WARNING: {failure_message}")

failures.append((record_id, field, record.get(field, "")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto Tom's comment on falling back to an empty string here instead of None.

I think we want to report the exact value we are seeing in the data in our error messages.

tests/functional/curate/cram/titlecase.t Show resolved Hide resolved
@j23414 j23414 marked this pull request as draft April 12, 2023 18:19
@j23414 j23414 force-pushed the curate/titlecase branch 2 times, most recently from ac8b0a6 to 6852345 Compare April 21, 2023 23:27
@j23414
Copy link
Contributor Author

j23414 commented Apr 21, 2023

Thanks for the reviews @tsibley and @joverlee521 ! I think I addressed most edits but let me know if I've misunderstood something.

The only remaining thing is handling record.get(field,"") as referenced in the following threads:

My main understanding of the ramifications of record.get(field,"") is in:

Do you think record.get(field,"") should be changed to record.get(field) so that non-existent fields error out?

$ echo '{"region":"europe", "country":"france" }' \
  >   | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
ERROR: Failed to titlecase 'division':None in record 0

If so, happy to add it as a fixup to commit 6852345 . Otherwise this PR is ready for review.

@j23414 j23414 marked this pull request as ready for review April 22, 2023 00:05
@tsibley
Copy link
Member

tsibley commented Apr 24, 2023

I'd think to make non-existent and existent-but-null-valued fields pass thru unchanged by default. And then, as Jover mentions, optionally (and not necessarily as part of this PR), allow those cases to instead be reported as warnings or made into fatal errors.

I'd also think that, by default, fields that exist with a non-null value but one that's not a string should produce an error.

@@ -93,7 +93,7 @@ def run(args, records):

for field in args.titlecase_fields:
titlecased_string = titlecase(record.get(field, ""), articles, abbreviations)
field_value = record.get(field, "")
field_value = record.get(field, "Not a field")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using strings like this "Not a field" as placeholders can be confusing. For example, in the messages below which use {field_value!r} the placeholder value will represented as if it was the actual value found in the record. Let's fix this while addressing the missing/null-valued field behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you that the placeholder can be confusing. Just documenting my thought process that this was a hacky way to get a descriptive error message below.

  $ echo '{"region":"europe", "country":"france" }' \
  >   | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
  ERROR: Failed to titlecase 'division':'Not a field' in record 0
  [2]

Thanks, I'll revisit this and change it

Copy link
Contributor Author

@j23414 j23414 Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification question about:

I'd think to make non-existent and existent-but-null-valued fields pass thru unchanged by default.

The original script (transform-string-fields) allows non-existent fields to get passed through and added to the final entry.

  $ echo '{"region":"europe", "country":"france" }' \
  >   | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
  {"region": "Europe", "country": "France", "division": "", "location": "", "not exist": ""}

Just making sure the above is what you mean by "non-existent pass thru unchanged by default".

Or were you expecting something like below:

  $ echo '{"region":"europe", "country":"france" }' \
  >   | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
  {"region": "Europe", "country": "France"}

I'm adding these as test cases to make sure we have documented edgecases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect the latter, i.e. fields which didn't exist in an input record are not added to output records. Something like:

  $ echo '{"a": "hello world", "b": null}' \
  >   | ${AUGUR} curate titlecase --titlecase-fields a b c
  {"a": "Hello World", "b": null}

@j23414 j23414 force-pushed the curate/titlecase branch 2 times, most recently from c4bfd75 to c87bdd3 Compare April 25, 2023 17:24
@j23414
Copy link
Contributor Author

j23414 commented Apr 25, 2023

From #1197 (comment), splitting this out to a separate thread:

I'd also think that, by default, fields that exist with a non-null value but one that's not a string should produce an error.

I've got the the following test which does produce an error:

Test case that fails on a non-string int

  $ echo '{"bare_int": 2021}' \
  >   | ${AUGUR} curate titlecase --titlecase-fields "bare_int"
  ERROR: Failed to titlecase 'bare_int':2021 in record 0
  [2]

@tsibley Is there another test case you were referring to by "not a string" & "non-null value"? Happy to add a test, I'm just drawing a blank on examples.

@tsibley
Copy link
Member

tsibley commented Apr 25, 2023

@tsibley Is there another test case you were referring to by "not a string" & "non-null value"? Happy to add a test, I'm just drawing a blank on examples.

Nope! That's the sort of thing I meant. I was describing the behaviour I expected, not how it differed from the current behaviour in the PR.

I do think that the error message like:

Failed to titlecase 'bare_int':2021 in record 0

would be improved by explaining why titlecasing failed.

@j23414 j23414 marked this pull request as draft April 28, 2023 15:40
@j23414 j23414 force-pushed the curate/titlecase branch 3 times, most recently from 72e6c95 to f8b1388 Compare July 14, 2023 21:58
@j23414 j23414 marked this pull request as ready for review July 14, 2023 22:20
@j23414
Copy link
Contributor Author

j23414 commented Jul 14, 2023

Thanks @joverlee521 and @tsibley! Added the remaining changes with @joverlee521.

Test cases when fields do not exist, decide if this should error out and may affect ingest pipelines
$ echo '{"region":"europe", "country":"france" }' \
> | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
{"region": "Europe", "country": "France"}

Test case that passes on empty or null values
$ echo '{"empty": "", "null_entry":null }' \
> | ${AUGUR} curate titlecase --titlecase-fields "empty" "null_entry"
{"empty": "", "null_entry": null}

Test case that fails on a non-string int
$ echo '{"bare_int": 2021}' \
> | ${AUGUR} curate titlecase --titlecase-fields "bare_int"
ERROR: Failed to titlecase 'bare_int':2021 in record 0 because the value is a <class 'int'> and is not a string.
[2]
Test case that fails on complext types (e.g. arrays)
$ echo '{"an_array": ["hello", "world"]}' \
> | ${AUGUR} curate titlecase --titlecase-fields "an_array"
ERROR: Failed to titlecase 'an_array':['hello', 'world'] in record 0 because the value is a <class 'list'> and is not a string.
[2]

This PR is ready for review

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continue to push on this @j23414 🙏

There's a final bug that we missed that needs to be fixed and tested in the Cram tests, but everything else looks good to me.

tests/functional/curate/cram/titlecase.t Show resolved Hide resolved
augur/curate/titlecase.py Outdated Show resolved Hide resolved
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @j23414! This just requires a final rebase to fix the changelog merge conflicts and should be good to merge.

Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <[email protected]>
j23414 and others added 4 commits July 28, 2023 12:09
Tests a few more valid (and invalid) edge cases for augur curate titlecase.

1. Skip processing fields that have a null value
2. Throw an error on fields that have a bare-int (or non-string) value
3. Skip processing fields that do not exist in the data
@j23414 j23414 merged commit 632585c into master Jul 28, 2023
26 checks passed
@j23414 j23414 deleted the curate/titlecase branch July 28, 2023 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

augur curate titlecase
3 participants