-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Curate titlecase #1197
Curate titlecase #1197
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1197 +/- ##
==========================================
+ Coverage 69.36% 69.60% +0.24%
==========================================
Files 66 67 +1
Lines 7024 7087 +63
Branches 1708 1726 +18
==========================================
+ Hits 4872 4933 +61
- Misses 1847 1848 +1
- Partials 305 306 +1
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for learning the ropes of the Augur curate framework with this titlecase
command!
The four commits here feel like they should be just one. Most of the commits after the first are fixes for the first. We also don't typically use the feat:
, fix:
, etc. prefixes in Augur. The loose convention is to prefix with the major or primary "area" that the commit touches; for example curate:
, docs:
, io/metadata:
, etc. Jover's commits introducing the Augur curate commands are a good recent example.
I left a bunch of notes as I read thru the PR below (but haven't run the code myself).
augur/curate/titlecase.py
Outdated
record_id = index | ||
|
||
for field in args.titlecase_fields: | ||
titlecased_string = titlecase(record.get(field, ""), articles, abbreviations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why fall back to the empty string instead of None? This would end up converting from a null value in the JSON record to an empty string value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think we did it this way to be able to check for failures below with
if titlecased_string is None:
Looking at this now, it does seem like a weird side-effect to change null values to empty strings here.
Since the titlecase
function only ever returns None
if the field value is not a string, we could be checking for the value type outside of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would end up converting from a null value in the JSON record to an empty string value.
change null values to empty strings
Sorry it took me a while to circle back on this. Looking at python's dictionary get function, the record.get(field,"")
is checking if a field (e.g. "division") exists but not the field's value (null or otherwise).
Because without changing anything, titlecase errors out as expected on null and bare integer values:
Test case that fails on null
$ echo '{"empty": "", "null_entry":null }' \
> | ${AUGUR} curate titlecase --titlecase-fields "empty" "null_entry"
ERROR: Failed to titlecase 'null_entry':None in record 0
[2]
Test case that fails on a non-string int
$ echo '{"bare_int": 2021}' \
> | ${AUGUR} curate titlecase --titlecase-fields "bare_int"
ERROR: Failed to titlecase 'bare_int':2021 in record 0
[2]
The record.get(field,"")
allows non-existent fields to pass (e.g. "division", "not exist")
Test cases when fields do not exist
Decide if this should error out and might affect configs like:
# https://github.com/nextstrain/monkeypox/blob/9e6c71d039cf4ef30ccb75749b634ffc9f2cf8ac/ingest/config/config.yaml#L31
$ echo '{"region":"europe", "country":"france" }' \
> | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
{"region": "Europe", "country": "France", "division": "", "location": "", "not exist": ""}
If the line is changed to record.get(field)
, then non-existent fields will error out.
$ echo '{"region":"europe", "country":"france" }' \
> | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
ERROR: Failed to titlecase 'division':None in record 0
I think we want the non-existent fields to error out? This decision may affect how we call augur curate titlecase
in monkeypox and other pathogen ingest pipelines, or maybe not. Open to thoughts and suggestions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, using record.get(field)
will not allow you to differentiate when a field simply does not exist or when the field value is null. We can separate these two if we want different behaviors or different error messages for each case:
if field not in record:
# do something if field does not exist
elif record[field] is None:
# do something if field value is null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working through this @j23414! I hope this was a good small intro for the Augur ecosystem 🙏
I left a couple of minor comments, but the one thing that stands out (thanks to Tom's comment on the None vs empty string values) is that the only reason for failure for this command seems to be when the field value is not a string. With that in mind, I think the help/error messages can be more explicit on requiring titlecase fields to be string values.
I think this also leads to an open question of if the command should handle null/None values separately. I imagine there's use cases for both allowing null values and for erroring on null values in the titlecase fields separate from other type errors? (This doesn't need to be addressed in this PR, just thinking out loud)
augur/curate/titlecase.py
Outdated
record_id = index | ||
|
||
for field in args.titlecase_fields: | ||
titlecased_string = titlecase(record.get(field, ""), articles, abbreviations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think we did it this way to be able to check for failures below with
if titlecased_string is None:
Looking at this now, it does seem like a weird side-effect to change null values to empty strings here.
Since the titlecase
function only ever returns None
if the field value is not a string, we could be checking for the value type outside of the function.
augur/curate/titlecase.py
Outdated
if failure_reporting is DataErrorMethod.WARN: | ||
print_err(f"WARNING: {failure_message}") | ||
|
||
failures.append((record_id, field, record.get(field, ""))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto Tom's comment on falling back to an empty string here instead of None.
I think we want to report the exact value we are seeing in the data in our error messages.
9a3e137
to
618391f
Compare
618391f
to
42cdb48
Compare
ac8b0a6
to
6852345
Compare
Thanks for the reviews @tsibley and @joverlee521 ! I think I addressed most edits but let me know if I've misunderstood something. The only remaining thing is handling My main understanding of the ramifications of Do you think
If so, happy to add it as a fixup to commit 6852345 . Otherwise this PR is ready for review. |
I'd think to make non-existent and existent-but- I'd also think that, by default, fields that exist with a non-null value but one that's not a string should produce an error. |
augur/curate/titlecase.py
Outdated
@@ -93,7 +93,7 @@ def run(args, records): | |||
|
|||
for field in args.titlecase_fields: | |||
titlecased_string = titlecase(record.get(field, ""), articles, abbreviations) | |||
field_value = record.get(field, "") | |||
field_value = record.get(field, "Not a field") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using strings like this "Not a field"
as placeholders can be confusing. For example, in the messages below which use {field_value!r}
the placeholder value will represented as if it was the actual value found in the record. Let's fix this while addressing the missing/null-valued field behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with you that the placeholder can be confusing. Just documenting my thought process that this was a hacky way to get a descriptive error message below.
$ echo '{"region":"europe", "country":"france" }' \
> | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
ERROR: Failed to titlecase 'division':'Not a field' in record 0
[2]
Thanks, I'll revisit this and change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification question about:
I'd think to make non-existent and existent-but-null-valued fields pass thru unchanged by default.
The original script (transform-string-fields) allows non-existent fields to get passed through and added to the final entry.
$ echo '{"region":"europe", "country":"france" }' \
> | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
{"region": "Europe", "country": "France", "division": "", "location": "", "not exist": ""}
Just making sure the above is what you mean by "non-existent pass thru unchanged by default".
Or were you expecting something like below:
$ echo '{"region":"europe", "country":"france" }' \
> | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
{"region": "Europe", "country": "France"}
I'm adding these as test cases to make sure we have documented edgecases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect the latter, i.e. fields which didn't exist in an input record are not added to output records. Something like:
$ echo '{"a": "hello world", "b": null}' \
> | ${AUGUR} curate titlecase --titlecase-fields a b c
{"a": "Hello World", "b": null}
c4bfd75
to
c87bdd3
Compare
From #1197 (comment), splitting this out to a separate thread:
I've got the the following test which does produce an error:
@tsibley Is there another test case you were referring to by "not a string" & "non-null value"? Happy to add a test, I'm just drawing a blank on examples. |
Nope! That's the sort of thing I meant. I was describing the behaviour I expected, not how it differed from the current behaviour in the PR. I do think that the error message like:
would be improved by explaining why titlecasing failed. |
72e6c95
to
f8b1388
Compare
Thanks @joverlee521 and @tsibley! Added the remaining changes with @joverlee521. augur/tests/functional/curate/cram/titlecase.t Lines 64 to 68 in 4a39f5b
augur/tests/functional/curate/cram/titlecase.t Lines 43 to 47 in 4a39f5b
augur/tests/functional/curate/cram/titlecase.t Lines 50 to 62 in 4a39f5b
This PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continue to push on this @j23414 🙏
There's a final bug that we missed that needs to be fixed and tested in the Cram tests, but everything else looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @j23414! This just requires a final rebase to fix the changelog merge conflicts and should be good to merge.
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <[email protected]>
Tests a few more valid (and invalid) edge cases for augur curate titlecase. 1. Skip processing fields that have a null value 2. Throw an error on fields that have a bare-int (or non-string) value 3. Skip processing fields that do not exist in the data
Co-authored-by: Jover Lee <[email protected]>
Description of proposed changes
Adds a new sub-command
augur curate titlecase
based on the transform-string-fields script in the monkeypox repo. Theaugur curate normalize
sub-command has already been added based on the same script (#1039).See commits for details.
Related issue(s)
Closes #999
Related to #860
Testing
Added Cram tests for the new sub-command and includes doctests.
Checklist
augur curate titlecase
sub-command.