Skip to content

Commit

Permalink
Merge branch 'master' into clean-metadata
Browse files Browse the repository at this point in the history
  • Loading branch information
eharkins committed Aug 10, 2020
2 parents 668041a + cfc5009 commit 5b19e00
Show file tree
Hide file tree
Showing 9 changed files with 8,645 additions and 1,577 deletions.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: '[branch] Ingest 2019-nCov/SARS-CoV-2 data from GISAID for nextstrain.org/ncov'
name: '[branch] Fetch & Ingest 2019-nCov/SARS-CoV-2 data from GISAID for nextstrain.org/ncov'

on:
push:
Expand All @@ -18,7 +18,7 @@ jobs:
python3 -m pip install --upgrade pip setuptools
python3 -m pip install pipenv
pipenv sync
pipenv run ./bin/ingest-gisaid
pipenv run ./bin/ingest-gisaid --fetch
env:
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
Expand Down
27 changes: 27 additions & 0 deletions .github/workflows/fetch-and-ingest-gisaid-master.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: 'Fetch & Ingest 2019-nCov/SARS-CoV-2 data from GISAID for nextstrain.org/ncov'

on:
# Manually triggered using `./bin/trigger fetch-and-ingest`
repository_dispatch:
types: fetch-and-ingest

jobs:
ingest:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v1
- name: ingest
run: |
PATH="$HOME/.local/bin:$PATH"
python3 -m pip install --upgrade pip setuptools
python3 -m pip install pipenv
pipenv sync
pipenv run ./bin/ingest-gisaid --fetch
env:
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GISAID_API_ENDPOINT: ${{ secrets.GISAID_API_ENDPOINT }}
GISAID_USERNAME_AND_PASSWORD: ${{ secrets.GISAID_USERNAME_AND_PASSWORD }}
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
SLACK_CHANNELS: ncov-gisaid-updates
2 changes: 0 additions & 2 deletions .github/workflows/ingest-gisaid-master.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,5 @@ jobs:
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GISAID_API_ENDPOINT: ${{ secrets.GISAID_API_ENDPOINT }}
GISAID_USERNAME_AND_PASSWORD: ${{ secrets.GISAID_USERNAME_AND_PASSWORD }}
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
SLACK_CHANNELS: ncov-gisaid-updates
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,23 @@ If you're using Pipenv (see below), then run commands from `./bin/…` inside a

## Running automatically
The ingest pipeline exists as the GitHub workflows `.github/workflows/ingest-master-*.yml` and `…/ingest-branch-*.yml`.
It is run on pushes to `master` that modify `source-data/annotations.tsv` and on pushes to other branches.
It is run on pushes to `master` that modify `source-data/*-annotations.tsv` and on pushes to other branches.
Pushes to branches other than `master` upload files to branch-specific paths in the S3 bucket, don't send notifications, and don't trigger Nextstrain rebuilds, so that they don't interfere with the production data.

AWS credentials are stored in this repository's secrets and are associated with the `nextstrain-ncov-ingest-uploader` IAM user in the Bedford Lab AWS account, which is locked down to reading and publishing only the `gisaid.ndjson`, `metadata.tsv`, and `sequences.fasta` files and their zipped equivalents in the `nextstrain-ncov-private` S3 bucket.

## Manually triggering the automation
You can manually trigger the full automation by running `./bin/trigger ingest --user <your-github-username>`.
If you want to only trigger a rebuild of [nextstrain/ncov](https://github.com/nextstrain/ncov) without re-ingesting data from GISAID first, run `./bin/trigger rebuild --user <your-github-username>`.
See the output of `./bin/trigger ingest` or `./bin/trigger rebuild` for more information about authentication with GitHub.
A full run is a now done in 3 steps via manual triggers:
1. Fetch new sequences and ingest them by running `./bin/trigger fetch-and-ingest --user <your-github-username>`.
2. Add manual annotations, update location hierarchy as needed, and run ingest without fetching new sequences.
* Pushes of `source-data/*-annotations.tsv` to the master branch will automatically trigger a run of ingest.
* You can also run ingest manually by running `./bin/trigger ingest --user <your-github-username>`.
3. Once all manual fixes are complete, trigger a rebuild of [nextstrain/ncov](https://github.com/nextstrain/ncov) by running `./bin/trigger rebuild --user <your-github-username>`.

See the output of `./bin/trigger fetch-and-ingest --user <your-github-username>`, `./bin/trigger ingest` or `./bin/trigger rebuild` for more information about authentication with GitHub.

Note: running `./bin/trigger` posts a GitHub `repository_dispatch`.
Regardless of which branch you are on, it will trigger the specified action on the master branch.

## Updating manual annotations
Manual annotations should be added to `source-data/gisaid_annotations.tsv`.
Expand Down
156 changes: 102 additions & 54 deletions bin/ingest-gisaid
Original file line number Diff line number Diff line change
@@ -1,69 +1,117 @@
#!/bin/bash
# usage: ingest-gisaid [--fetch]
# ingest-gisaid --help
#
# Ingest SARS-CoV-2 metadata and sequences from GISAID.
#
# If the --fetch flag is given, new records are fetched from GISAID. Otherwise,
# ingest from the existing GISAID NDJSON file on S3.
#
set -euo pipefail

: "${S3_SRC:=s3://nextstrain-ncov-private}"
: "${S3_DST:=$S3_SRC}"

# Determine where to save data files based on if we're running as a result of a
# push to master or to another branch (or locally, outside of the GitHub
# workflow). Files are always compared to the default/primary paths in the
# source S3 bucket.
#
silent=
branch=
main() {
local fetch=0

for arg; do
case "$arg" in
-h|--help)
print-help
exit
;;
--fetch)
fetch=1
shift
break
;;
esac
done

# Determine where to save data files based on if we're running as a result of a
# push to master or to another branch (or locally, outside of the GitHub
# workflow). Files are always compared to the default/primary paths in the
# source S3 bucket.
#
local silent=
local branch=

case "${GITHUB_REF:-}" in
refs/heads/master)
# Do nothing different; defaults above are good.
branch=master
;;
refs/heads/*)
# Save data files under a per-branch prefix
silent=yes
branch="${GITHUB_REF##refs/heads/}"
S3_DST="$S3_DST/branch/$branch"
;;
"")
# Save data files under a tmp prefix
silent=yes
S3_DST="$S3_DST/tmp"
;;
*)
echo "Skipping ingest for ref $GITHUB_REF"
exit 0
;;
esac

echo "S3_SRC is $S3_SRC"
echo "S3_DST is $S3_DST"

case "${GITHUB_REF:-}" in
refs/heads/master)
# Do nothing different; defaults above are good.
branch=master
;;
refs/heads/*)
# Save data files under a per-branch prefix
silent=yes
branch="${GITHUB_REF##refs/heads/}"
S3_DST="$S3_DST/branch/$branch"
;;
"")
# Save data files under a tmp prefix
silent=yes
S3_DST="$S3_DST/tmp"
;;
*)
echo "Skipping ingest for ref $GITHUB_REF"
exit 0
;;
esac
cd "$(dirname "$0")/.."

echo "S3_SRC is $S3_SRC"
echo "S3_DST is $S3_DST"
set -x

cd "$(dirname "$0")/.."
if [[ "$fetch" == 1 ]]; then
./bin/fetch-from-gisaid > data/gisaid.ndjson
if [[ "$branch" == master ]]; then
./bin/notify-on-record-change data/gisaid.ndjson "$S3_SRC/gisaid.ndjson.gz" "GISAID"
fi
./bin/upload-to-s3 --quiet data/gisaid.ndjson "$S3_DST/gisaid.ndjson.gz"
else
aws s3 cp --no-progress "$S3_DST/gisaid.ndjson.gz" - | gunzip -cfq > data/gisaid.ndjson
fi

set -x
./bin/transform-gisaid data/gisaid.ndjson \
--output-metadata data/gisaid/metadata.tsv \
--output-fasta data/gisaid/sequences.fasta

./bin/fetch-from-gisaid > data/gisaid.ndjson
if [[ "$branch" == master ]]; then
./bin/notify-on-record-change data/gisaid.ndjson "$S3_SRC/gisaid.ndjson.gz" "GISAID"
fi
./bin/upload-to-s3 --quiet data/gisaid.ndjson "$S3_DST/gisaid.ndjson.gz"
./bin/flag-metadata data/gisaid/metadata.tsv > data/gisaid/flagged_metadata.txt
./bin/check-locations data/gisaid/metadata.tsv \
data/gisaid/location_hierarchy.tsv \
gisaid_epi_isl

./bin/transform-gisaid data/gisaid.ndjson \
--output-metadata data/gisaid/metadata.tsv \
--output-fasta data/gisaid/sequences.fasta
if [[ "$branch" == master ]]; then
./bin/notify-on-metadata-change data/gisaid/metadata.tsv "$S3_SRC/metadata.tsv.gz" gisaid_epi_isl
./bin/notify-on-additional-info-change data/gisaid/additional_info.tsv "$S3_SRC/additional_info.tsv.gz"
./bin/notify-on-flagged-metadata-change data/gisaid/flagged_metadata.txt "$S3_SRC/flagged_metadata.txt.gz"
./bin/notify-on-location-hierarchy-addition data/gisaid/location_hierarchy.tsv source-data/location_hierarchy.tsv
fi

./bin/flag-metadata data/gisaid/metadata.tsv > data/gisaid/flagged_metadata.txt
./bin/check-locations data/gisaid/metadata.tsv \
data/gisaid/location_hierarchy.tsv \
gisaid_epi_isl
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/metadata.tsv "$S3_DST/metadata.tsv.gz"
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/additional_info.tsv "$S3_DST/additional_info.tsv.gz"
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/flagged_metadata.txt "$S3_DST/flagged_metadata.txt.gz"
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/sequences.fasta "$S3_DST/sequences.fasta.gz"
}

if [[ "$branch" == master ]]; then
./bin/notify-on-metadata-change data/gisaid/metadata.tsv "$S3_SRC/metadata.tsv.gz" gisaid_epi_isl
./bin/notify-on-additional-info-change data/gisaid/additional_info.tsv "$S3_SRC/additional_info.tsv.gz"
./bin/notify-on-flagged-metadata-change data/gisaid/flagged_metadata.txt "$S3_SRC/flagged_metadata.txt.gz"
./bin/notify-on-location-hierarchy-addition data/gisaid/location_hierarchy.tsv source-data/location_hierarchy.tsv
fi
print-help() {
# Print the help comments at the top of this file ($0)
local line
while read -r line; do
if [[ $line =~ ^#! ]]; then
continue
elif [[ $line =~ ^# ]]; then
line="${line/##/}"
line="${line/# /}"
echo "$line"
else
break
fi
done < "$0"
}

./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/metadata.tsv "$S3_DST/metadata.tsv.gz"
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/additional_info.tsv "$S3_DST/additional_info.tsv.gz"
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/flagged_metadata.txt "$S3_DST/flagged_metadata.txt.gz"
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/sequences.fasta "$S3_DST/sequences.fasta.gz"
main "$@"
2 changes: 1 addition & 1 deletion bin/trigger
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
set -euo pipefail

bin="$(dirname "$0")"
event_type="${1:?An event type ("ingest" or "rebuild") is required as the first argument.}"
event_type="${1:?An event type ("fetch-and-ingest", "ingest" or "rebuild") is required as the first argument.}"
shift

if [[ $# -eq 0 ]]; then
Expand Down
Loading

0 comments on commit 5b19e00

Please sign in to comment.