Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ingest for VIDRL flat files #164

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from
Draft

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Oct 17, 2024

Description of proposed changes

Update ingest of VIDRL flat files for the latest version available via OneDrive.

Example command that I've been running during my testing to upload to the test_tdb database.
This automatically ingests both the _flat_file and the matching _reference_panel file if the _reference_panel file exists.

envdir ../env.d/seasonal-flu/ \
    python3 tdb/vidrl_upload.py \
        -d test_tdb \
        -v flu \
        --subtype h3n2 \
        --assay_type hi \
        --path ~/Documents/WHO\ CC\ Melb\ antigenic\ data/oneDrive\ flat\ files/H3/HI/ \
        --fstem 0902.xlsx_H3_flat_file \
        --ftype flat

Related issue(s)

Resolves #161

TODOs

  • resolve human serum vaccine strain and reference passage mismatch
  • resolve extra records in _reference_panel file with pool suffix in reference strain
  • resolve duplicate records from a/b _reference_files
  • resolve mismatch in strain names (Slack thread)
  • verify titer values converted from >10240 -> 20480 is okay (Slack thread)
  • resolve missing data in flat files compared to Excel (Slack thread)

@joverlee521 joverlee521 force-pushed the vidrl-flat-file branch 2 times, most recently from 614afad to f85a2e2 Compare November 6, 2024 00:44
The column map will be more complicated with the need to ingest two
slightly different flat files (_flat_file.csv and _reference_panel.csv)
as discussed in #161 (comment).

I also found myself constantly toggling back and forth between the
separate column_map.tsv and the upload script to figure out how the
columns are being used, so it makes more sense to just hard-code the
column map in the script.
Update column map based on `0906.xlsx_H1_flat_file.csv` in comparison
to the matching Excel file `20240906\ H1N1.xlsx` available on
VIDRL's OneDrive.
Avoid pandas typing issues by just using the Python csv module
to read and write the flat files. Mimics `augur curate` with independent
functions for reading, curating, and writing records.
Doing this in preparation for processing the flat files that includes
human sera measurements. The human serum ids will be parsed the same way
for the flat files to ensure that we use the same standardized id.
Strip the "pool" suffix from the serum strain name, standardize the
egg or cell type, and standardize the serum id.

While looking into this change, I discovered that the strain name used
for the human sera references in H1 and H3 is the egg vaccine strain
regardless of passage annotation. Currently unclear if this is an error
in the flat files or if we've misunderstood the passage annotations for
human sera data. Once we clear this up, we should add some type of
vaccine strain verification so that we can flag mismatches like this
automatically.
In order to include the "assay_date" in the uploaded data, the VIDRL
column needs to be "date" so that it can be parsed within `elife_upload`
as "assay_date".

This is an ugly work around, but it's similar to how cdc_upload handles
the field.¹

¹ <https://github.com/nextstrain/fauna/blob/b133974275ee1ed4e91816c76db6b7616247b6dc/tdb/cdc_upload.py#L58>
Validate records in single flat file. Ensure that the serum
abbreviations map to a single serum strain and all records have the
same test date.

As a side effect, the validated `serum_abbr_map` and `test_date` are
returned to be used for processing the reference panel records in
following commits.
Pull out curation into individual functions that can be shared with the
curation of the _reference_panel.csv file.
Ingests the matching "*_reference_panel.csv" for a provided
"*_flat_file" fstem if the reference panel file exists. The records
parsed from the reference panel file is appended to the same tmp
file that is then passed to elife_upload.py.

This currently includes "extra" records in comparison to Excel files,
where the human sera pool strain is the "test virus" against the other
references. If I strip the `pool` suffix from human sera pool strain,
the measurements are exact duplicates of the measurements for the
matching reference strain. We will need to decide whether or not these
records should be dropped.
Based on comment in Slack¹ that the "e" or "c" suffix in the serum ID
is not a reliable indicator of human serum passage.

¹ <https://bedfordlab.slack.com/archives/C03KWDET9/p1728430958054989?thread_ts=1699914235.686809&cid=C03KWDET9>
Based on meeting with VIDRL, we should only keep homologous titers
for `virus_strain` that includes "pool" suffix. This will act as a proxy
homologous titer for the human serum references. All other virus strains
that include the "pool" suffix are ignored because they are duplicate
data.
Based on meeting with VIDRL, a/b and _1/_2 reference panel files are
created from the same Excel file so they are duplicates while capital
A/B files are separate assays.

So, this changes allows us to check for the a/b and _1/_2 patterns and
ignore the reference panel file if it's a duplicate. This means we
always ingest the a or _1 file but ignore the b and _2 files.
Using the latest flat file column `original designation` to use the
original strain name that has not gone through VIDRL's strain name
standardizations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit ingestion of VIDRL flat files
1 participant