Skip to content

Commit

Permalink
Merge pull request #220 from nextstrain/yellow-fever-dataset
Browse files Browse the repository at this point in the history
Add yellow fever virus dataset
  • Loading branch information
corneliusroemer authored Oct 18, 2024
2 parents f0e8d1e + c17cd1d commit 745ffb9
Show file tree
Hide file tree
Showing 18 changed files with 27,854 additions and 3 deletions.
3 changes: 2 additions & 1 deletion data/nextstrain/collection.json
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
"nextstrain/flu/h3n2/pb2",
"nextstrain/measles",
"nextstrain/measles/N450/WHO-2012",
"nextstrain/dengue/all"
"nextstrain/dengue/all",
"nextstrain/yellow-fever/prM-E"
]
}
3 changes: 3 additions & 0 deletions data/nextstrain/yellow-fever/prM-E/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Unreleased

Initial release of yellow fever virus (prM-E region only) dataset.
60 changes: 60 additions & 0 deletions data/nextstrain/yellow-fever/prM-E/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Yellow fever virus (prM-E region only) dataset

| Key | Value |
| ----------------- | -----------------------------------------------------------------|
| name | Yellow fever virus (YFV) prM-E region |
| authors | [Nextstrain](https://nextstrain.org) |
| reference | AY640589.1 |
| workflow | <https://github.com/nextstrain/yellow-fever/tree/main/nextclade> |
| path | `nextstrain/yellow-fever/prM-E` |

## Scope of this dataset

This dataset assigns clades to yellow fever virus samples based on
strain and genotype information from [Mutebi et al.][] (J Virol. 2001
Aug;75(15):6999-7008) and [Bryant et al.][] (PLoS Pathog. 2007 May 18;3(5):e75)

These two papers, collectively, define 7 distinct yellow fever virus
genotypes based on a 670 nucleotide region of the yellow fever virus
genome, (bases 641-1310), called the prM-E region. This region
comprises the 3' end of the pre-membrane protein (prM) gene, the
entire membrane protein (M) gene, and the 5' end of the envelope
protein (E) gene.

The clades we annotate (Clade I-VII) are roughly equivalent with the
following genotypes as described in the aforementioned two papers:

| Clade | Genotype |
|-----------|---------------------|
| Clade I | Angola |
| Clade II | East Africa |
| Clade III | East Central/Africa |
| Clade IV | West Africa I |
| Clade V | West Africa II |
| Clade VI | South America I |
| Clade VII | South America II |

(N.b., the reference sequence used in this data set is actually 672nt
long, from bases 641-1312 of the genome reference. The 2 extra bases
make the reference a complete open reading frame.)

This dataset can be used to assign genotypes to any sequence that
includes at least 500 bp of the prM-E region, including whole genome
sequences. Sequence data beyond the prM-E region will be reported as an
insertion in the Nextclade output.

## Features

This dataset supports:

- Assignment of genotypes
- Phylogenetic placement
- Sequence quality control (QC)

## What are Nextclade datasets

Read more about Nextclade datasets in the Nextclade documentation:
<https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html>

[Mutebi et al.]: https://pubmed.ncbi.nlm.nih.gov/11435580/
[Bryant et al.]: https://pubmed.ncbi.nlm.nih.gov/17511518/
5 changes: 5 additions & 0 deletions data/nextstrain/yellow-fever/prM-E/genome_annotation.gff3
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
##sequence-region prM-E 1 672
NC_002031.1 feature source 1 672 . + . gene=nuc
NC_002031.1 feature gene 1 333 . + . gene_name=prM
NC_002031.1 feature gene 109 333 . + . gene_name=M
NC_002031.1 feature gene 334 672 . + . gene_name=E
51 changes: 51 additions & 0 deletions data/nextstrain/yellow-fever/prM-E/pathogen.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
{
"files": {
"reference": "reference.fasta",
"pathogenJson": "pathogen.json",
"genomeAnnotation": "genome_annotation.gff3",
"treeJson": "tree.json",
"examples": "sequences.fasta",
"readme": "README.md",
"changelog": "CHANGELOG.md"
},
"attributes": {
"name": "Yellow fever virus (YFV) prM-E region",
"reference name": "Asibi",
"reference accession": "AY640589.1"
},
"schemaVersion": "3.0.0",
"alignmentParams": {
"minSeedCover": 0.01
},
"qc": {
"missingData": {
"enabled": true,
"missingDataThreshold": 20,
"scoreBias": 4
},
"mixedSites": {
"enabled": true,
"mixedSitesThreshold": 4
},
"frameShifts": {
"enabled": true
},
"stopCodons": {
"enabled": true
},
"privateMutations": {
"enabled": true,
"cutoff": 12,
"typical": 4,
"weightLabeledSubstitutions": 1,
"weightReversionSubstitutions": 1,
"weightUnlabeledSubstitutions": 1
},
"snpClusters": {
"enabled": true,
"clusterCutOff": 3,
"scoreWeight": 50,
"windowSize": 50
}
}
}
13 changes: 13 additions & 0 deletions data/nextstrain/yellow-fever/prM-E/reference.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
> prM-E region (genome 641-1312, 672 nt)
CCAAGAGAGGAGCCAGATGACATTGATTGCTGGTGCTATGGGGTGGAAAACGTTAGAGTC
GCATATGGTAAGTGTGACTCAGCAGGCAGGTCTAGGAGGTCAAGAAGGGCCATTGACTTG
CCTACGCATGAAAACCATGGTTTGAAGACCCGGCAAGAAAAATGGATGACTGGAAGAATG
GGTGAAAGGCAACTCCAAAAGATTGAGAGATGGCTCGTGAGGAACCCCTTTTTTGCAGTG
ACAGCTCTGACCATTGCCTACCTTGTGGGAAGCAACATGACGCAACGAGTCGTGATTGCC
CTACTGGTCTTGGCTGTTGGTCCGGCCTACTCAGCTCACTGCATTGGAATTACTGACAGG
GATTTCATTGAGGGGGTGCATGGAGGAACTTGGGTTTCAGCTACCCTGGAGCAAGACAAG
TGTGTCACTGTTATGGCCCCTGACAAGCCTTCATTGGACATCTCACTAGAGACAGTAGCC
ATTGATGGACCTGCTGAGGCGAGGAAAGTGTGTTACAATGCAGTTCTCACTCATGTGAAG
ATTAATGACAAGTGCCCCAGCACTGGAGAGGCCCACCTAGCTGAAGAGAACGAAGGGGAC
AATGCGTGCAAGCGCACTTATTCTGATAGAGGCTGGGGCAATGGCTGTGGCCTATTTGGG
AAAGGGAGCATT
Loading

0 comments on commit 745ffb9

Please sign in to comment.