Compile inputs into "annotation bundles" that are downloadable #938

chrisamiller · 2020-08-28T21:32:38Z

initially, just for common species (human, mouse) and updates to coincide with version releases

chrisamiller · 2020-09-02T13:42:19Z

I've created annotation bundles containing the inputs outlined on the Gathering Input Files wiki page. They run between 50-100 gigs unzipped, and I don't think there's much room for improvement in size.

/gscmnt/gc2761/aml_ppg/tmp/annotation_bundles/GRCh38_ensembl95
/gscmnt/gc2761/aml_ppg/tmp/annotation_bundles/GRCm38_ensembl101
/gscmnt/gc2761/aml_ppg/tmp/annotation_bundles/GRCm38_ensembl95

At this point, I'd like to solicit feedback on

what's missing?

Cellranger references are behind a TOS so we can't include them, but we will include instructions for adding them
is it worth including a "standard" exome target region region based on the ensembl GTF?
what else?

Whether it is desirable to provide this bundle as a single input for the pipelines (or make some sort of wrapper script that can fill in the appropriate yaml fields after being pointed to the bundle?)
How we can distribute these: GCP bucket? Locally hosted HTTP/FTP server?

malachig · 2020-09-02T14:28:12Z

This is awesome.

A couple thoughts:

Under Aligner indices it says that "HISAT Index (for older pipeline versions only)". This is because STAR is used now I assume. But then the other RNA aligner index is labelled as STAR-fusion index. Is this specific to fusion calling or does it work generally with STAR aligner? Maybe clarify that?
I don't see any mention of the alt contigs file for bwa-mem? Since this can lead to great confusion, perhaps it is worth special mention?
Isn't there a file that helps convert between chromosome names is still used by some tools in the pipeline? e.g. /storage1/fs1/bga/Active/gmsroot/gc2560/core/model_data/2887491634/build50f99e75d14340ffb5b7d21b03887637/chromAlias.ensembl.txt
My recent immuno.cwl test used this file: /storage1/fs1/bga/Active/gmsroot/gc2560/core/GRC-human-build38_human_95_38_U2AF1_fix/rna_seq_annotation/ensembl95.transcriptToGene.tsv. Is the deprecated now perhaps?
Similarly, cosmic_vcf: /storage1/fs1/bga/Active/gmsroot/gc2560/core/build_merged_alignments/detect-variants--linus2112.gsc.wustl.edu-jwalker-774-52fa2dadc4ba4b6490abe0701907d394/snvs.hq.vcf.gz?
Similarly, custom_clinvar_vcf: /storage1/fs1/bga/Active/gmsroot/gc2560/core/custom_clinvar_vcf/v20181028/custom.vcf.gz?
Similarly, panel_of_normals_vcf: /storage1/fs1/bga/Active/gmsroot/gc2560/core/build_merged_alignments/detect-variants--linus2112.gsc.wustl.edu-jwalker-2237-db881b860992443da9d6aac8b36a7ea6/snvs.hq.vcf.gz? This one, I am not sure what it is used for...

malachig · 2020-09-02T14:32:56Z

For distributing them, I think a Google and/or AWS S3 bucket would be appropriate. Generally if we do a file server, we will have to do that with a cloud server anyway. And the volume attached will be active disk and cost more per month than just placing them in a bucket.

We have been using genomedata.org for this kind of thing for a while now but it is surprising how much the cost adds up to keep it up and maintain a backup.

chrisamiller · 2020-09-02T14:44:56Z

Great feedback - thanks!

Yeah, it's my understanding that star fusion contains the star index plus some other stuff, and that it's all that's needed (@sridhar0605 - can you confirm this?)
ooh yeah - we need the alt contigs file in there. Good call
that file is included in the VEP cache, and we do make use of it when building the bundle, in order to convert the gtf. See https://github.com/genome/analysis-workflows/wiki/Gathering-input-files#genestranscripts-gtf-
Each one of them has that ensembl95.transcriptToGene.tsv equivalent in the rna_seq_annotation directory
I believe that GATK4 mutect no longer uses cosmic as an input. Is there somewhere else in the pipeline that's using it (I'm not seeing one)?
What's the source of the custom clinvar VCF? I think it's just an internal list of variants that the CLE wants extra coverage on, right? If that's true, then it's optional but we can also distribute it, if it's a) shareable and b) we think it's worthwhile.
Like cosmic, the panel of normals isn't used by the newer version of mutect anymore. Best I can tell, It's still listed as an input in several places, but never actually used. Making an issue for that

tmooney · 2020-09-02T14:48:47Z

One option might be to distribute some partially filled-in input YAMLs with the bundle. If we do make versions of the pipelines that accept the bundles as a single input, I hope we also keep versions that don't. I'd like the flexibility to use existing files I have lying around various places without having to convert them into a bundle first.

We already have a GCP bucket with the test input files from our repo, so that seems like a reasonable place to put it (as long as it isn't too expensive to host).

chrisamiller added this to the 2.0 milestone Aug 28, 2020

chrisamiller removed this from the 2.0 milestone Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile inputs into "annotation bundles" that are downloadable #938

Compile inputs into "annotation bundles" that are downloadable #938

chrisamiller commented Aug 28, 2020

chrisamiller commented Sep 2, 2020

malachig commented Sep 2, 2020

malachig commented Sep 2, 2020

chrisamiller commented Sep 2, 2020

tmooney commented Sep 2, 2020

Compile inputs into "annotation bundles" that are downloadable #938

Compile inputs into "annotation bundles" that are downloadable #938

Comments

chrisamiller commented Aug 28, 2020

chrisamiller commented Sep 2, 2020

malachig commented Sep 2, 2020

malachig commented Sep 2, 2020

chrisamiller commented Sep 2, 2020

tmooney commented Sep 2, 2020