Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsv cleanup script and workflow #82

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

tsv cleanup script and workflow #82

wants to merge 23 commits into from

Conversation

eharkins
Copy link
Contributor

Description of proposed changes

This PR introduces the following changes (hopefully as discussed in #78):

  • A script to automatically sort and clean up the number of tabs in tsv metadata files in this repo
  • An automated workflow to run that script on source-data/gisaid_annotations.tsv and source-data/location_hierarchy.tsv and commit those changes to the branch on which the commit that triggered the workflow was pushed
  • A ton of changes to source-data/gisaid_annotations.tsv and source-data/location_hierarchy.tsv according to the way the script sorted them. Some of these are because things were out of alphabetical order (both source-data/gisaid_annotations.tsv and source-data/location_hierarchy.tsv), and others have to do with comments and splitting of paper annotations (source-data/gisaid_annotations.tsv). I included all these changes since if we start using this workflow on master, those changes will happen almost immediately anyway but we will have to update these changes when we are ready to merge. Any feedback on how the new sorting meets the expectations discussed (e.g.) would be appreciated!

To merge this, we will also need to :

  • change the branch name in the workflow to master
  • rerun the sorting and cleanup according to the latest metadata files.
  • decide whether we want to apply the script (automatically or otherwise) to any manually maintained tsvs in this repo (such as the genbank annotations?)

Related issue(s)

#78

Testing

Tested sorting and these two cases from #78:

some rows only have 3 columns instead of 4 (which GitHub currently points out when you view this file in the browser), i.e. a missing final tab
some rows have 5 columns instead of 4, i.e. a surplus final tab

@emmahodcroft
Copy link
Member

This is looking good @eharkins ! Though hard to wrap head around all the reorganisation. Just to clarify, this would run automatically after pushing, for example, new location_hierarchy or gisaid_annotations?
So I suppose we don't want to push this until we've decoupled those changes from triggering new data ingest (pulling from GISAID), or we'll have even more chances to get new data 🙃

A good final test for this branch might be to double-check the metadata files produced from the 'original' and 'reorganised' files and ensure all is coming out the same - that's probably fastest way to check we didn't accidentally mess anything up!

@eharkins
Copy link
Contributor Author

Just to clarify, this would run automatically after pushing, for example, new location_hierarchy or gisaid_annotations? So I suppose we don't want to push this until we've decoupled those changes from triggering new data ingest (pulling from GISAID), or we'll have even more chances to get new data 🙃

This is accurate, although if I remember correctly I was only having to cancel ingests upon manually pushing changes to those files and not when the automated commits from the new github actions workflow were pushed - this could be different on master since I know there are slight differences in those workflows on master vs other branches. I agree with being 100% sure we dont risk extra ingests since those are the main bottleneck / slow down for builds as I see it.

A good final test for this branch might be to double-check the metadata files produced from the 'original' and 'reorganised' files and ensure all is coming out the same - that's probably fastest way to check we didn't accidentally mess anything up!

Great idea. In order to not mess with "production" builds, would that involve basically running bin/ingest-gisaid but without the steps where the latest metadata is pushed to aws and the results are alerted in slack? I.e. a "local" run of that ingest process which would not affect the outcome of the next "official" ingest?

@eharkins
Copy link
Contributor Author

Just noting that maybe we can reuse the sorting script to sort some files in the ncov repo as well (e.g. defaults, and some of the files used by the metadata script in that repo).

@emmahodcroft
Copy link
Member

I've asked in the 'stop ingest from annotations push' issue if we can go ahead and merge that (I think it seems ready?) - would be nice in any case to stop worrying about new data any time we ingest! And then we could merge this.

Great idea. In order to not mess with "production" builds, would that involve basically running bin/ingest-gisaid but without the steps where the latest metadata is pushed to aws and the results are alerted in slack? I.e. a "local" run of that ingest process which would not affect the outcome of the next "official" ingest?

Yes, exactly. That should generate a metadata file that we can then compare to the 'official' metadata from AWS, and hopefully they'll match exactly and we're good to go! They should be done as closely together as possible to avoid grabbing any new data in one but not the other 🙃

@eharkins
Copy link
Contributor Author

Yes, exactly. That should generate a metadata file that we can then compare to the 'official' metadata from AWS, and hopefully they'll match exactly and we're good to go!

👍

They should be done as closely together as possible to avoid grabbing any new data in one but not the other

Maybe I can avoid this constraint / risk of fetching new stuff if I just run the transform-gisaid script which applies the annotations and the check-locations script?

@emmahodcroft
Copy link
Member

I think just starting them roughly at the same time should be fine - you just wouldn't want to compare the AWS run from the afternoon to a fresh-download via local code in the evening, that's all! I am never 100% sure exactly what pulls from our local GISIAD copy vs GISAID directly, but one way to do it would definitely just be to start a local download the same time you trigger an ingest on AWS. But if you can find a fancier way, that would work too!

Copy link
Contributor

@kairstenfay kairstenfay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Eli,
Thanks for getting started on this.
I see you iterated on the clean-tsv workflow over a few commits here. Would you mind rebasing (e.g. fixing up or squashing) all or most of those clean-tsv edits into one commit and splitting them apart from any metadata TSV changes here?

bin/clean-tsv-metadata Outdated Show resolved Hide resolved
.github/workflows/clean-tsv.yml Outdated Show resolved Hide resolved
.github/workflows/clean-tsv.yml Outdated Show resolved Hide resolved
.github/workflows/clean-tsv.yml Outdated Show resolved Hide resolved
bin/clean-tsv-metadata Outdated Show resolved Hide resolved
Copy link
Contributor

@kairstenfay kairstenfay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eharkins,

I just ran the ./bin/clean-tsv-metadata script locally and here are a few of my observations:

  1. The --metadata option should perhaps be a positional argument. Here's my motivation behind why. Because this file is named "clean TSV", the first argument, to me, follows that it should be a TSV. We've taken a similar approach in ./bin/transform-gisaid, where the first (positional) argument is GISAID data.

  2. If this script is to be used as a general TSV cleaner, I would recommend naming it "clean TSV" instead of suffixing "metadata" to the name. To me, nCoV "metadata" only means GISAID or Genbank metadata. That is, the file produced by a transform script.

  3. When I run ./bin/clean-tsv-metadata data/gisaid/metadata.tsv, I get the following error:

Traceback (most recent call last):
  File "./bin/clean-tsv-metadata", line 45, in <module>
    clean_metadata_file(args.metadata, args.n_cols, args.header)
  File "./bin/clean-tsv-metadata", line 11, in clean_metadata_file
    col_names = list(pd.read_csv(file_name, sep="\t", header=None, nrows=1).iloc[0]) if header else list(range(n_cols))
TypeError: 'NoneType' object cannot be interpreted as an integer

It was not immediately clear to me why I receive this error. I realized it was because I neglected to pass the --header option, which made me wonder if we would expect the default behavior to be that the script respects a header, and instead we invoke an option like --ignore-header when we don't want to read in the first row as a header. This would follow the pandas default for reading in the first row as a header.
Another option would be to assert that either the --header or a valid value for --n-cols is provided, e.g. something like

    assert args.header or args.n_cols, \
        "Either the --header option or a non-zero value for --n-cols option must be provided."

before the first line where we read in a CSV.

  1. I'm not a huge fan of scripts that aren't idempotent. Here we read in a file, edit it in memory, and then rewrite it to the same file path. I can imagine that some people would prefer to inspect the resulting, cleaned TSV and compare it with the original (I certainly desire that functionality in testing). My thoughts are to either prompt a user for an output filepath, or to print to stdout and make the user redirect the output to a new file.

Thanks again for your work here, and I greatly appreciate your rebasing these commits. I'll continue to review other parts of this PR, but I wanted to get the conversation started on a few of my initial impressions.

Let me know if you have any questions about what I've written so far, by the way. I am happy to have a conversation around these points and include other Nextstrain members to get their thoughts as well.

2. Sort alphabetically and make sure n columns (n-1 tabs) each row
3. Write out (overwrite) metadata file
"""
col_names = list(pd.read_csv(file_name, sep="\t", header=None, nrows=1).iloc[0]) if header else list(range(n_cols))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me what value reading in the column names (header) separately adds to this script.

If I run pd.read_csv() with header=None, then the columns are automatically named using zero-indexed integers.

Additionally, I can think of an edge case where we may want to specify --header and also enforce a number of columns, which the code as written does not currently support.

I would recommend instead dropping the col_names assignment line, reading in the TSV with the header argument passed, and handling surplus columns in a separate part of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See f95f2de, I couldn't think of a smarter way than reading it in a separate time.

bin/clean-tsv-metadata Outdated Show resolved Hide resolved
bin/clean-tsv-metadata Outdated Show resolved Hide resolved
.github/workflows/clean-tsv.yml Outdated Show resolved Hide resolved
.github/workflows/clean-tsv.yml Outdated Show resolved Hide resolved
bin/clean-tsv-metadata Outdated Show resolved Hide resolved
@kairstenfay
Copy link
Contributor

It may be worth documenting somewhere (in the README or in the help description of this command/the n-cols option) how many columns are expected in the usual suspects: gisaid_annotations.tsv and location_hierarchy.tsv.

@eharkins
Copy link
Contributor Author

@kairstenfay thanks for all the comments! I made almost all of the changes in #82 (review) since I agreed with them overwhelmingly. Instead of --ignore-header, I went with --no-header, since it seems more likely you would be telling the script that no header exists in the file rather than ignoring one that does.

If this looks good, I can test a little bit more and address some of the final to-dos, e.g.:

@eharkins
Copy link
Contributor Author

One thing to note about forcing files to have the correct amount of columns/tabs as I do here is that it doesn't alert us to cases where all the information is there but someone forgot to enter a tab between two columns in a row.

Copy link
Contributor

@kairstenfay kairstenfay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes, Eli! This is looking great. Here are the results of my testing.

  1. GISAID annotations
head -n 20 source-data/gisaid_annotations.tsv > source-data/head_gisaid_annotations.tsv
./bin/clean-tsv source-data/head_gisaid_annotations.tsv temp --n-cols 4

Inspecting temp, I see a first row inserted at the top:
# Unnamed: 1 Unnamed: 2 Unnamed: 3

This behavior is fixed with using the --no-header option, so I believe this test passes ✔️

  1. Location hierarchy
./bin/clean-tsv source-data/head_location_hierarchy.tsv temp --n-cols 4

temp looks great! ✔️ Filling in missing tabs works as expected.

  1. Metadata
head -n 20 data/gisaid/metadata.tsv > data/gisaid/head_metadata.tsv
./bin/clean-tsv data/gisaid/head_metadata.tsv temp

temp looks great! ✔️

  1. Truncating columns w/ n-cols
head -n 20 data/gisaid/metadata.tsv > data/gisaid/head_metadata.tsv
./bin/clean-tsv data/gisaid/head_metadata.tsv temp --n-cols 2

temp still has all the original columns, not only 2 columns as I would expect. I am unable to produce an example of --n-cols truncating columns in a TSV as we would expect.

  1. Exceeding the number of columns found in the TSV with --n-cols produces this error:
 ./bin/clean-tsv source-data/head_location_hierarchy.tsv temp --no-header --n-cols 5
Traceback (most recent call last):
  File "./bin/clean-tsv", line 69, in <module>
    clean_tsv_file(args.tsv, args.output_file, args.n_cols, not args.no_header, args.sort_col)
  File "./bin/clean-tsv", line 18, in clean_tsv_file
    usecols=list(range(n_cols))) # using first n only; this removes extra tabs
  File "/home/kairsten/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/kairsten/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/kairsten/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/kairsten/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1012, in pandas._libs.parsers.TextReader._convert_column_data
pandas.errors.ParserError: Too many columns specified: expected 5 and found 4

Is this the behavior we desire? Do we want to ever add an entirely blank, extra column? Should we guard this error with a more user friendly statement?

.github/workflows/clean-tsv.yml Outdated Show resolved Hide resolved
@kairstenfay
Copy link
Contributor

In my rebase, I accidentally caused a huge merge conflict in the automatically cleaned TSVs. I've dropped the automated commit for now until we are sure that the clean script works exactly as expected.

@kairstenfay
Copy link
Contributor

One thing to note about forcing files to have the correct amount of columns/tabs as I do here is that it doesn't alert us to cases where all the information is there but someone forgot to enter a tab between two columns in a row.

Hmm I feel like there are some important edge cases like that one that could be missed in an automated cleanup. We could post the automatically cleaned differences to Slack for review.

Another edge case would be if someone accidentally entered an extra tab in between two columns and produced, for e.g., five columns in one row in the location_hierarchy.tsv. That fifth column would be dropped.

I don't know if it makes sense to write code for all these possible edge cases or to rely on human review of the machine output. Thoughts?

@eharkins
Copy link
Contributor Author

eharkins commented Sep 2, 2020

Thanks @kairstenfay!

In my rebase, I accidentally caused a huge merge conflict in the automatically cleaned TSVs. I've dropped the automated commit for now until we are sure that the clean script works exactly as expected.

I've been using the following to accept the versions of those files from master (and then subsequently running the cleaning on them), but removing the automated commit also makes sense until this is ready to merge.

git checkout --theirs source-data/gisaid_annotations.tsv source-data/location_hierarchy.tsv

Truncating columns w/ n-cols

This is fixed in 86c271b :

(nextstrain) MC02T50AUGYGR:ncov-ingest eharkins$ head -n 20 data/gisaid/metadata.tsv > data/gisaid/head_metadata.tsv
./bin/clean-tsv data/gisaid/head_metadata.tsv temp --n-cols 2
(nextstrain) MC02T50AUGYGR:ncov-ingest eharkins$ ./bin/clean-tsv data/gisaid/head_metadata.tsv temp --n-cols 2
(nextstrain) MC02T50AUGYGR:ncov-ingest eharkins$ head temp
strain  virus
Algeria/G0638_2264/2020 ncov
Algeria/G0640_2265/2020 ncov

Exceeding the number of columns found in the TSV with --n-cols produces this error:

Error is more informative and human friendly now (47668cc):

(nextstrain) MC02T50AUGYGR:ncov-ingest eharkins$ ./bin/clean-tsv data/gisaid/head_metadata.tsv temp --n-cols 100
Too many columns specified: expected 100 and found 26
--n-cols 100 was passed, but there are not this many columns in data/gisaid/head_metadata.tsv. --n-cols can't add extra columns, it just enforces up to the existing number of columns in the tsv.

I can't think of an example where I've wanted to add an extra empty column in the context of the ncov builds, but if this is a function we want then it seems doable!

Hmm I feel like there are some important edge cases [...] I don't know if it makes sense to write code for all these possible edge cases or to rely on human review of the machine output. Thoughts?

Maybe it's better to just have us run it locally when we are making additions to manually maintained tsvs during the course of a usual ingest. That way we are more likely to review the results of the clean up as a part of the diff we commit including our human-entered annotations, etc. And remove the workflow altogether. I don't want to over-automate things in a way that causes us more trouble than it's worth!

@eharkins
Copy link
Contributor Author

eharkins commented Sep 2, 2020

@emmahodcroft @MoiraZuber I talked to @kairstenfay on slack who expressed support for this plan, given the above:

  1. remove the workflow in this PR since maybe it's more trouble than it's worth to accommodate edge cases / alert us to corrections made so we can review them by eye.
  2. Make a final commit when we are ready to merge which cleans/sorts the locations and gisaid annotations.
  3. From there on out, we would use bin/clean-tsv locally to sort and clean locations and annotations as we add them each day before committing.

Kairsten is planning a few edits to the PR, but just wanted to check in about that plan in the meantime.

@kairstenfay
Copy link
Contributor

@emmahodcroft @MoiraZuber I talked to @kairstenfay on slack who expressed support for this plan, given the above:

1. remove the workflow in this PR since maybe it's more trouble than it's worth to accommodate edge cases / alert us to corrections made so we can review them by eye.

2. Make a final commit when we are ready to merge which cleans/sorts the locations and gisaid annotations.

3. From there on out, we would use bin/clean-tsv locally to sort and clean locations and annotations as we add them each day before committing.

Kairsten is planning a few edits to the PR, but just wanted to check in about that plan in the meantime.

Thanks for reiterating our Slack convo, @eharkins ! I agree with Eli's intuition that automatically cleaning the TSV and committing its changes without human review may cause more headaches than is worthwhile. To be clear, we are still going to write this clean-tsv script, just not automate it (unless your feedback suggests we should do otherwise).

@kairstenfay kairstenfay force-pushed the clean-metadata branch 2 times, most recently from ea09281 to a90ed8b Compare September 3, 2020 01:22
@emmahodcroft
Copy link
Member

emmahodcroft commented Sep 16, 2020

Hi @kairstenfay

I'm not as familiar with annotations, so I have a few questions here.

I'm sorry! I really wasn't clear here - my apologies. Mostly I was just referring to things there that I thought we should fix ourselves just to make the sorting neater for commented things - like since they come out alphabetically we can just try to stick to a format so that it looks tidier after sorting - putting a space after # (since that's what majority did), and using ## to help keep our 'header' near the bottom, just above the real data. But these were all changes I made already in master and pulled in. These could go into the script but given it's comments I don't think it's so critical - just something we can try to do ourselves (and really, had mostly done anyway) to keep things tidy!

It still needs a little more testing, but feel free to give it a try and see if the sort/no-sort split addresses your concerns!

Oh, that sounds great! I will try to test this tomorrow, then hopefully can use diff to explore any differences :)

I'll have to do a little more testing on this to know for sure, but can you explain what you mean by staring "clean"? Could you also please elaborate on your concerns about losing stuff?

Mostly I suppose I'm just wondering if columns might get stripped or added that will make it harder to detect any mistakes that exist right now, in future. I think if we have anything without the right number of columns at the moment we would be getting a warning - from my looking at the gisaid_transform code today - so I'm actually feeling more relaxed about this now. I guess I was just a little concerned that if we did have something that only has, say, 3 columns now, we could maybe find this more easily (?) now, since it's 'wrong' - but once we have a copy where it has 4 columns due to one being added, it might more easily 'slip through the cracks'. This is rather theoretical - I wasn't sure if this is a real risk or not. But from my look today at transform (sorry I didn't think to do that earlier) I know that we would be getting warnings for anything that has != 4 columns and isn't a comment, which leaves me thinking this isn't a problem!

@kairstenfay
Copy link
Contributor

I'm sorry! I really wasn't clear here - my apologies. Mostly I was just referring to things there that I thought we should fix ourselves just to make the sorting neater for commented things - like since they come out alphabetically we can just try to stick to a format so that it looks tidier after sorting - putting a space after # (since that's what majority did), and using ## to help keep our 'header' near the bottom, just above the real data. But these were all changes I made already in master and pulled in. These could go into the script but given it's comments I don't think it's so critical - just something we can try to do ourselves (and really, had mostly done anyway) to keep things tidy!

No worries. These are interesting considerations, and it's worth filing an issue if you eventually desire some of these additional comment-parsing behaviors.

Oh, that sounds great! I will try to test this tomorrow, then hopefully can use diff to explore any differences :)

Great! I look forward to your feedback on whether or not this is a useful tool.

Mostly I suppose I'm just wondering if columns might get stripped or added that will make it harder to detect any mistakes that exist right now, in future. I think if we have anything without the right number of columns at the moment we would be getting a warning - from my looking at the gisaid_transform code today - so I'm actually feeling more relaxed about this now. I guess I was just a little concerned that if we did have something that only has, say, 3 columns now, we could maybe find this more easily (?) now, since it's 'wrong' - but once we have a copy where it has 4 columns due to one being added, it might more easily 'slip through the cracks'. This is rather theoretical - I wasn't sure if this is a real risk or not. But from my look today at transform (sorry I didn't think to do that earlier) I know that we would be getting warnings for anything that has != 4 columns and isn't a comment, which leaves me thinking this isn't a problem!

Ah, I see. So you're rightfully concerned about deleting data hanging out in extra tabs when we're trimming number of columns down to 4. You've raised this concern before and it makes sense. So long as we run transform before running clean-tsv (or its wrapper script, clean-source-data-tsvs), then we should see warnings emitted for misshapen annotations.
e.g. This workflow would result in misshapen annotations being flagged in Slack.

  1. edit annotations file
  2. run fetch-and-ingest (which calls transform)
  3. run clean-source-data-tsvs

This, second workflow, however, would not emit warnings about misshapen annotations and would "silently" apply changes (although these changes would be viewable in the diff that must be committed).

  1. edit annotations
  2. run clean-source-data-tsvs
  3. run fetch-and-ingest (which calls transform)

@emmahodcroft
Copy link
Member

Ok! Took me a minute to get a good testing strategy but what I ended up doing was stripping all the comments at the top (to avoid diffs due to adding tabs to comments) and removing all quotes from titles (to avoid diffs due to this alone) - only then running the script on this version.

Then running without sorting, things look really good - all I found was that some comments hidden 'in the data' so these had extra tabs added (not concerning, just missed my 'delete all comments').

When I ran with --sort I did notice a couple things a bit odd - first, the sorting of the 'paper data' at the end seems... odd? Here's the end of the data (it seems to put animal samples last), transitioning to the 'paper data':
image

It's ordered Wuhan, Guangdong, Thailand. I can't really tell if this is sorted, or what's happened. It's not in the order it is in the original file! Probably good to check something sensible is happening?

Finally, what should (ideally?) be the header for the 'paper data' was stuck at the end:
image

If it's tricky to put this at the beginning of the paper data, I think it would also be fine for it to go at the top of the entire file - but it doesn't really make sense to be at the end :) Also not really sure how it happened!

I'll edit the master gisaid_annotations to move the comments out of the data and move around a couple of other bits that shouldn't be where they are, which will ease any future testing if needed. However, overall I'm feeling pretty good about this (data being dropped/missed) now 👍

I guess the only last remaining unknown is maybe to run gisaid_transform with one of the cleaned gisaid_annotations.tsv data and ensure the lack of quotes & everything doesn't impact how metadata.tsv ends up?


edit annotations file
run fetch-and-ingest (which calls transform)
run clean-source-data-tsvs

I'm happy to propose to use this workflow - we can add this to the docs maybe with a brief note why this is the prescribed order. It doesn't look like we're missing a lot now, though, so that makes me think we have managed to do this ok mostly in the past - hopefully continuing in the future!

@emmahodcroft
Copy link
Member

I made the changes to gisaid_annotations on master and have pulled into this branch 👍

@eharkins
Copy link
Contributor Author

It's ordered Wuhan, Guangdong, Thailand. I can't really tell if this is sorted, or what's happened. It's not in the order it is in the original file! Probably good to check something sensible is happening?

It's sorted by EPI_ISL using --sort-col 1. I did this early on in the workflow - I guess I thought you might want to group annotations this way, to ensure paper_url and title for the same sequence were always next to each other, but I'm not sure this makes any more sense than using strain or something else. Feel free to change this part to accommodate how you want those annotations sorted (maybe even by --sort-col 3 ; sort them by the value of the annotation which might be helpful if that is often a unique way to identify the paper; not sure if you like to search by paper or by sequence).

edit annotations file
run fetch-and-ingest (which calls transform)
run clean-source-data-tsvs

What would this look like in practice as it relates to pushing to github / automation? I was imagining two reasons for doing this sorting: 1. to save time adding annotations so that we can just append them to the end of the file in any order before sorting and 2. to have the version on github be sorted so that annotations could be easily found from the version on github. Right now in terms of automation it seems like we would have

(manual): edit annotations file [and then push to github]
(automated on github): run fetch-and-ingest (which calls transform)
(manual after ingest finishes and we handle any warnings from transform about "misshapen annotations"?): run clean-source-data-tsvs
Then would we push again to github (triggering another ingest / transform) to make sure the version on github was sorted?

I was initially imagining doing it this way:

This, second workflow, however, would not emit warnings about misshapen annotations and would "silently" apply changes (although these changes would be viewable in the diff that must be committed).
edit annotations
run clean-source-data-tsvs
run fetch-and-ingest (which calls transform)

but if that defeats the purpose of the transform script's warnings, I agree we should do it the other way (or go with a third option that gets the best of both worlds but requires more work on this PR in order to incorporate the warnings about "misshapen" annotations into the tsv cleaning process before it asserts anything about number of columns).

Finally, what should (ideally?) be the header for the 'paper data' was stuck at the end:

Ah, that's because we are just sticking any line with the word paper in it in that section, and sorting all of them as if they are paper annotations. Two ways to fix that:

  1. be more specific with the grep command by searching for paper_url and then adding another grep to stick the paper annotations header before them
  2. do the separation of paper stuff in python so that we can have it more specialized.

Thanks @kairstenfay @emmahodcroft for all your work on this, I hope it remains a potential time-saver, improvement for all the trouble!

eharkins and others added 23 commits November 2, 2020 10:27
also remove "metadata" language
as it is meant to be all purpose
tsv cleaning script
Read in the target TSV only once, optionally enforcing a certain number
of columns with the --n-cols option. Achieve this by always reading in
the TSV with header=None to enforce an expected number of columns when
given --n-cols. Then, after reading in the data, if a header should be
used (i.e. --no-header is not specified), replace the column names with
the first row.
Prevent a user from passing negative or 0 values for the --n-cols
option.
The existing error message prints the python ValueError after the
helpful, custom error message we've written. This can easily get lost
on the user. Because there's no real need to share the underlying error
(since this error is pretty well scoped), stop printing it to stderr.
Don't exceed 100 characters per line.
Change the positional `output_file` argument to an option. By default,
print the cleaned and sorted TSV to stdout. This is the more common
approach in ncov-ingest when there is only one type of output.
Add a --sort option to the clean-tsv script which only performs a sort
when either it or the --sort-col option is invoked. Removing sorting
allows for easier inspection of what lines changed in a file for
comparison.
Add a bash script for cleaning manually maintained TSVs -- namely,
`./source-data/gisaid_annotations.tsv` and
`./source-data/location_hierarchy.tsv`.

Co-authored-by: eharkins <[email protected]>
This reverts commit 2c8ab37.

We decided that build maintainers want more control over automated
corrections to the manually maintained TSV files. So, let them run the
cleaning script locally instead of on GitHub actions.
@kairstenfay
Copy link
Contributor

@emmahodcroft how do you ideally want paper annotations sorted? Do you want to keep them separated within the file from the non-paper annotations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

3 participants