Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile inputs into "annotation bundles" that are downloadable #938

Open
chrisamiller opened this issue Aug 28, 2020 · 5 comments
Open

Compile inputs into "annotation bundles" that are downloadable #938

chrisamiller opened this issue Aug 28, 2020 · 5 comments

Comments

@chrisamiller
Copy link
Collaborator

initially, just for common species (human, mouse) and updates to coincide with version releases

@chrisamiller chrisamiller added this to the 2.0 milestone Aug 28, 2020
@chrisamiller
Copy link
Collaborator Author

I've created annotation bundles containing the inputs outlined on the Gathering Input Files wiki page. They run between 50-100 gigs unzipped, and I don't think there's much room for improvement in size.

/gscmnt/gc2761/aml_ppg/tmp/annotation_bundles/GRCh38_ensembl95
/gscmnt/gc2761/aml_ppg/tmp/annotation_bundles/GRCm38_ensembl101
/gscmnt/gc2761/aml_ppg/tmp/annotation_bundles/GRCm38_ensembl95

At this point, I'd like to solicit feedback on

  1. what's missing?
  • Cellranger references are behind a TOS so we can't include them, but we will include instructions for adding them
  • is it worth including a "standard" exome target region region based on the ensembl GTF?
  • what else?
  1. Whether it is desirable to provide this bundle as a single input for the pipelines (or make some sort of wrapper script that can fill in the appropriate yaml fields after being pointed to the bundle?)

  2. How we can distribute these: GCP bucket? Locally hosted HTTP/FTP server?

@malachig
Copy link
Collaborator

malachig commented Sep 2, 2020

This is awesome.

A couple thoughts:

  1. Under Aligner indices it says that "HISAT Index (for older pipeline versions only)". This is because STAR is used now I assume. But then the other RNA aligner index is labelled as STAR-fusion index. Is this specific to fusion calling or does it work generally with STAR aligner? Maybe clarify that?
  2. I don't see any mention of the alt contigs file for bwa-mem? Since this can lead to great confusion, perhaps it is worth special mention?
  3. Isn't there a file that helps convert between chromosome names is still used by some tools in the pipeline? e.g. /storage1/fs1/bga/Active/gmsroot/gc2560/core/model_data/2887491634/build50f99e75d14340ffb5b7d21b03887637/chromAlias.ensembl.txt
  4. My recent immuno.cwl test used this file: /storage1/fs1/bga/Active/gmsroot/gc2560/core/GRC-human-build38_human_95_38_U2AF1_fix/rna_seq_annotation/ensembl95.transcriptToGene.tsv. Is the deprecated now perhaps?
  5. Similarly, cosmic_vcf: /storage1/fs1/bga/Active/gmsroot/gc2560/core/build_merged_alignments/detect-variants--linus2112.gsc.wustl.edu-jwalker-774-52fa2dadc4ba4b6490abe0701907d394/snvs.hq.vcf.gz?
  6. Similarly, custom_clinvar_vcf: /storage1/fs1/bga/Active/gmsroot/gc2560/core/custom_clinvar_vcf/v20181028/custom.vcf.gz?
  7. Similarly, panel_of_normals_vcf: /storage1/fs1/bga/Active/gmsroot/gc2560/core/build_merged_alignments/detect-variants--linus2112.gsc.wustl.edu-jwalker-2237-db881b860992443da9d6aac8b36a7ea6/snvs.hq.vcf.gz? This one, I am not sure what it is used for...

@malachig
Copy link
Collaborator

malachig commented Sep 2, 2020

For distributing them, I think a Google and/or AWS S3 bucket would be appropriate. Generally if we do a file server, we will have to do that with a cloud server anyway. And the volume attached will be active disk and cost more per month than just placing them in a bucket.

We have been using genomedata.org for this kind of thing for a while now but it is surprising how much the cost adds up to keep it up and maintain a backup.

@chrisamiller
Copy link
Collaborator Author

Great feedback - thanks!

  1. Yeah, it's my understanding that star fusion contains the star index plus some other stuff, and that it's all that's needed (@sridhar0605 - can you confirm this?)

  2. ooh yeah - we need the alt contigs file in there. Good call

  3. that file is included in the VEP cache, and we do make use of it when building the bundle, in order to convert the gtf. See https://github.com/genome/analysis-workflows/wiki/Gathering-input-files#genestranscripts-gtf-

  4. Each one of them has that ensembl95.transcriptToGene.tsv equivalent in the rna_seq_annotation directory

  5. I believe that GATK4 mutect no longer uses cosmic as an input. Is there somewhere else in the pipeline that's using it (I'm not seeing one)?

  6. What's the source of the custom clinvar VCF? I think it's just an internal list of variants that the CLE wants extra coverage on, right? If that's true, then it's optional but we can also distribute it, if it's a) shareable and b) we think it's worthwhile.

  7. Like cosmic, the panel of normals isn't used by the newer version of mutect anymore. Best I can tell, It's still listed as an input in several places, but never actually used. Making an issue for that

@tmooney
Copy link
Member

tmooney commented Sep 2, 2020

One option might be to distribute some partially filled-in input YAMLs with the bundle. If we do make versions of the pipelines that accept the bundles as a single input, I hope we also keep versions that don't. I'd like the flexibility to use existing files I have lying around various places without having to convert them into a bundle first.

We already have a GCP bucket with the test input files from our repo, so that seems like a reasonable place to put it (as long as it isn't too expensive to host).

@chrisamiller chrisamiller removed this from the 2.0 milestone Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants