-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tsv cleanup script and workflow #82
base: master
Are you sure you want to change the base?
Conversation
This is looking good @eharkins ! Though hard to wrap head around all the reorganisation. Just to clarify, this would run automatically after pushing, for example, new location_hierarchy or gisaid_annotations? A good final test for this branch might be to double-check the metadata files produced from the 'original' and 'reorganised' files and ensure all is coming out the same - that's probably fastest way to check we didn't accidentally mess anything up! |
This is accurate, although if I remember correctly I was only having to cancel ingests upon manually pushing changes to those files and not when the automated commits from the new github actions workflow were pushed - this could be different on master since I know there are slight differences in those workflows on master vs other branches. I agree with being 100% sure we dont risk extra ingests since those are the main bottleneck / slow down for builds as I see it.
Great idea. In order to not mess with "production" builds, would that involve basically running bin/ingest-gisaid but without the steps where the latest metadata is pushed to aws and the results are alerted in slack? I.e. a "local" run of that ingest process which would not affect the outcome of the next "official" ingest? |
Just noting that maybe we can reuse the sorting script to sort some files in the ncov repo as well (e.g. defaults, and some of the files used by the metadata script in that repo). |
I've asked in the 'stop ingest from annotations push' issue if we can go ahead and merge that (I think it seems ready?) - would be nice in any case to stop worrying about new data any time we ingest! And then we could merge this.
Yes, exactly. That should generate a metadata file that we can then compare to the 'official' metadata from AWS, and hopefully they'll match exactly and we're good to go! They should be done as closely together as possible to avoid grabbing any new data in one but not the other 🙃 |
👍
Maybe I can avoid this constraint / risk of fetching new stuff if I just run the transform-gisaid script which applies the annotations and the check-locations script? |
I think just starting them roughly at the same time should be fine - you just wouldn't want to compare the AWS run from the afternoon to a fresh-download via local code in the evening, that's all! I am never 100% sure exactly what pulls from our local GISIAD copy vs GISAID directly, but one way to do it would definitely just be to start a local download the same time you trigger an ingest on AWS. But if you can find a fancier way, that would work too! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Eli,
Thanks for getting started on this.
I see you iterated on the clean-tsv workflow over a few commits here. Would you mind rebasing (e.g. fixing up or squashing) all or most of those clean-tsv edits into one commit and splitting them apart from any metadata TSV changes here?
6359974
to
5b19e00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @eharkins,
I just ran the ./bin/clean-tsv-metadata
script locally and here are a few of my observations:
-
The
--metadata
option should perhaps be a positional argument. Here's my motivation behind why. Because this file is named "clean TSV", the first argument, to me, follows that it should be a TSV. We've taken a similar approach in./bin/transform-gisaid
, where the first (positional) argument is GISAID data. -
If this script is to be used as a general TSV cleaner, I would recommend naming it "clean TSV" instead of suffixing "metadata" to the name. To me, nCoV "metadata" only means GISAID or Genbank metadata. That is, the file produced by a transform script.
-
When I run
./bin/clean-tsv-metadata data/gisaid/metadata.tsv
, I get the following error:
Traceback (most recent call last):
File "./bin/clean-tsv-metadata", line 45, in <module>
clean_metadata_file(args.metadata, args.n_cols, args.header)
File "./bin/clean-tsv-metadata", line 11, in clean_metadata_file
col_names = list(pd.read_csv(file_name, sep="\t", header=None, nrows=1).iloc[0]) if header else list(range(n_cols))
TypeError: 'NoneType' object cannot be interpreted as an integer
It was not immediately clear to me why I receive this error. I realized it was because I neglected to pass the --header
option, which made me wonder if we would expect the default behavior to be that the script respects a header, and instead we invoke an option like --ignore-header
when we don't want to read in the first row as a header. This would follow the pandas default for reading in the first row as a header.
Another option would be to assert that either the --header
or a valid value for --n-cols
is provided, e.g. something like
assert args.header or args.n_cols, \
"Either the --header option or a non-zero value for --n-cols option must be provided."
before the first line where we read in a CSV.
- I'm not a huge fan of scripts that aren't idempotent. Here we read in a file, edit it in memory, and then rewrite it to the same file path. I can imagine that some people would prefer to inspect the resulting, cleaned TSV and compare it with the original (I certainly desire that functionality in testing). My thoughts are to either prompt a user for an output filepath, or to print to stdout and make the user redirect the output to a new file.
Thanks again for your work here, and I greatly appreciate your rebasing these commits. I'll continue to review other parts of this PR, but I wanted to get the conversation started on a few of my initial impressions.
Let me know if you have any questions about what I've written so far, by the way. I am happy to have a conversation around these points and include other Nextstrain members to get their thoughts as well.
bin/clean-tsv-metadata
Outdated
2. Sort alphabetically and make sure n columns (n-1 tabs) each row | ||
3. Write out (overwrite) metadata file | ||
""" | ||
col_names = list(pd.read_csv(file_name, sep="\t", header=None, nrows=1).iloc[0]) if header else list(range(n_cols)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me what value reading in the column names (header) separately adds to this script.
If I run pd.read_csv()
with header=None
, then the columns are automatically named using zero-indexed integers.
Additionally, I can think of an edge case where we may want to specify --header
and also enforce a number of columns, which the code as written does not currently support.
I would recommend instead dropping the col_names
assignment line, reading in the TSV with the header
argument passed, and handling surplus columns in a separate part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See f95f2de, I couldn't think of a smarter way than reading it in a separate time.
It may be worth documenting somewhere (in the README or in the help description of this command/the n-cols option) how many columns are expected in the usual suspects: gisaid_annotations.tsv and location_hierarchy.tsv. |
@kairstenfay thanks for all the comments! I made almost all of the changes in #82 (review) since I agreed with them overwhelmingly. Instead of If this looks good, I can test a little bit more and address some of the final to-dos, e.g.: |
One thing to note about forcing files to have the correct amount of columns/tabs as I do here is that it doesn't alert us to cases where all the information is there but someone forgot to enter a tab between two columns in a row. |
4fde2e5
to
d82cd3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these changes, Eli! This is looking great. Here are the results of my testing.
- GISAID annotations
head -n 20 source-data/gisaid_annotations.tsv > source-data/head_gisaid_annotations.tsv
./bin/clean-tsv source-data/head_gisaid_annotations.tsv temp --n-cols 4
Inspecting temp
, I see a first row inserted at the top:
# Unnamed: 1 Unnamed: 2 Unnamed: 3
This behavior is fixed with using the --no-header
option, so I believe this test passes ✔️
- Location hierarchy
./bin/clean-tsv source-data/head_location_hierarchy.tsv temp --n-cols 4
temp
looks great! ✔️ Filling in missing tabs works as expected.
- Metadata
head -n 20 data/gisaid/metadata.tsv > data/gisaid/head_metadata.tsv
./bin/clean-tsv data/gisaid/head_metadata.tsv temp
temp
looks great! ✔️
- Truncating columns w/ n-cols
head -n 20 data/gisaid/metadata.tsv > data/gisaid/head_metadata.tsv
./bin/clean-tsv data/gisaid/head_metadata.tsv temp --n-cols 2
temp
still has all the original columns, not only 2 columns as I would expect. I am unable to produce an example of --n-cols
truncating columns in a TSV as we would expect.
- Exceeding the number of columns found in the TSV with
--n-cols
produces this error:
./bin/clean-tsv source-data/head_location_hierarchy.tsv temp --no-header --n-cols 5
Traceback (most recent call last):
File "./bin/clean-tsv", line 69, in <module>
clean_tsv_file(args.tsv, args.output_file, args.n_cols, not args.no_header, args.sort_col)
File "./bin/clean-tsv", line 18, in clean_tsv_file
usecols=list(range(n_cols))) # using first n only; this removes extra tabs
File "/home/kairsten/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/kairsten/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/kairsten/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/kairsten/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1012, in pandas._libs.parsers.TextReader._convert_column_data
pandas.errors.ParserError: Too many columns specified: expected 5 and found 4
Is this the behavior we desire? Do we want to ever add an entirely blank, extra column? Should we guard this error with a more user friendly statement?
f1dca14
to
f482272
Compare
In my rebase, I accidentally caused a huge merge conflict in the automatically cleaned TSVs. I've dropped the automated commit for now until we are sure that the clean script works exactly as expected. |
Hmm I feel like there are some important edge cases like that one that could be missed in an automated cleanup. We could post the automatically cleaned differences to Slack for review. Another edge case would be if someone accidentally entered an extra tab in between two columns and produced, for e.g., five columns in one row in the location_hierarchy.tsv. That fifth column would be dropped. I don't know if it makes sense to write code for all these possible edge cases or to rely on human review of the machine output. Thoughts? |
Thanks @kairstenfay!
I've been using the following to accept the versions of those files from master (and then subsequently running the cleaning on them), but removing the automated commit also makes sense until this is ready to merge.
This is fixed in 86c271b :
Error is more informative and human friendly now (47668cc):
I can't think of an example where I've wanted to add an extra empty column in the context of the ncov builds, but if this is a function we want then it seems doable!
Maybe it's better to just have us run it locally when we are making additions to manually maintained tsvs during the course of a usual ingest. That way we are more likely to review the results of the clean up as a part of the diff we commit including our human-entered annotations, etc. And remove the workflow altogether. I don't want to over-automate things in a way that causes us more trouble than it's worth! |
47668cc
to
7c06d38
Compare
@emmahodcroft @MoiraZuber I talked to @kairstenfay on slack who expressed support for this plan, given the above:
Kairsten is planning a few edits to the PR, but just wanted to check in about that plan in the meantime. |
Thanks for reiterating our Slack convo, @eharkins ! I agree with Eli's intuition that automatically cleaning the TSV and committing its changes without human review may cause more headaches than is worthwhile. To be clear, we are still going to write this |
ea09281
to
a90ed8b
Compare
Hi @kairstenfay
I'm sorry! I really wasn't clear here - my apologies. Mostly I was just referring to things there that I thought we should fix ourselves just to make the sorting neater for commented things - like since they come out alphabetically we can just try to stick to a format so that it looks tidier after sorting - putting a space after # (since that's what majority did), and using ## to help keep our 'header' near the bottom, just above the real data. But these were all changes I made already in master and pulled in. These could go into the script but given it's comments I don't think it's so critical - just something we can try to do ourselves (and really, had mostly done anyway) to keep things tidy!
Oh, that sounds great! I will try to test this tomorrow, then hopefully can use diff to explore any differences :)
Mostly I suppose I'm just wondering if columns might get stripped or added that will make it harder to detect any mistakes that exist right now, in future. I think if we have anything without the right number of columns at the moment we would be getting a warning - from my looking at the |
No worries. These are interesting considerations, and it's worth filing an issue if you eventually desire some of these additional comment-parsing behaviors.
Great! I look forward to your feedback on whether or not this is a useful tool.
Ah, I see. So you're rightfully concerned about deleting data hanging out in extra tabs when we're trimming number of columns down to 4. You've raised this concern before and it makes sense. So long as we run
This, second workflow, however, would not emit warnings about misshapen annotations and would "silently" apply changes (although these changes would be viewable in the diff that must be committed).
|
I made the changes to |
It's sorted by EPI_ISL using
What would this look like in practice as it relates to pushing to github / automation? I was imagining two reasons for doing this sorting: 1. to save time adding annotations so that we can just append them to the end of the file in any order before sorting and 2. to have the version on github be sorted so that annotations could be easily found from the version on github. Right now in terms of automation it seems like we would have (manual): edit annotations file [and then push to github] I was initially imagining doing it this way:
but if that defeats the purpose of the transform script's warnings, I agree we should do it the other way (or go with a third option that gets the best of both worlds but requires more work on this PR in order to incorporate the warnings about "misshapen" annotations into the tsv cleaning process before it asserts anything about number of columns).
Ah, that's because we are just sticking any line with the word paper in it in that section, and sorting all of them as if they are paper annotations. Two ways to fix that:
Thanks @kairstenfay @emmahodcroft for all your work on this, I hope it remains a potential time-saver, improvement for all the trouble! |
also remove "metadata" language as it is meant to be all purpose tsv cleaning script
Read in the target TSV only once, optionally enforcing a certain number of columns with the --n-cols option. Achieve this by always reading in the TSV with header=None to enforce an expected number of columns when given --n-cols. Then, after reading in the data, if a header should be used (i.e. --no-header is not specified), replace the column names with the first row.
Prevent a user from passing negative or 0 values for the --n-cols option.
The existing error message prints the python ValueError after the helpful, custom error message we've written. This can easily get lost on the user. Because there's no real need to share the underlying error (since this error is pretty well scoped), stop printing it to stderr.
Don't exceed 100 characters per line.
Change the positional `output_file` argument to an option. By default, print the cleaned and sorted TSV to stdout. This is the more common approach in ncov-ingest when there is only one type of output.
Add a --sort option to the clean-tsv script which only performs a sort when either it or the --sort-col option is invoked. Removing sorting allows for easier inspection of what lines changed in a file for comparison.
Add a bash script for cleaning manually maintained TSVs -- namely, `./source-data/gisaid_annotations.tsv` and `./source-data/location_hierarchy.tsv`. Co-authored-by: eharkins <[email protected]>
This reverts commit 2c8ab37. We decided that build maintainers want more control over automated corrections to the manually maintained TSV files. So, let them run the cleaning script locally instead of on GitHub actions.
8d86de8
to
22cef3f
Compare
@emmahodcroft how do you ideally want paper annotations sorted? Do you want to keep them separated within the file from the non-paper annotations? |
Description of proposed changes
This PR introduces the following changes (hopefully as discussed in #78):
To merge this, we will also need to :
Related issue(s)
#78
Testing
Tested sorting and these two cases from #78: