Skip to content

pipelines_somatic_exome.cwl

Travis CI User edited this page Aug 8, 2020 · 31 revisions

Documentation for somatic_exome.cwl

This page is auto-generated. Do not edit.

Overview

somatic_exome: exome alignment and somatic variant detection

Introduction

somatic_exome is designed to perform processing of mutant/wildtype H.sapiens exome sequencing data. It features BQSR corrected alignments, 4 caller variant detection, and vep style annotations. Structural variants are detected via manta and cnvkit. In addition QC metrics are run, including somalier concordance metrics.

example input file = analysis_workflows/example_data/somatic_exome.yaml

Inputs

Name Label Description Type Secondary Files
reference reference: Reference fasta file for a desired assembly reference contains the nucleotide sequence for a given assembly (hg37, hg38, etc.) in fasta format for the entire genome. This is what reads will be aligned to. Appropriate files can be found on ensembl at https://ensembl.org/info/data/ftp/index.html When providing the reference secondary files corresponding to reference indices must be located in the same directory as the reference itself. These files can be created with samtools index, bwa index, and picard CreateSequenceDictionary. ['string', 'File'] ['.fai', '^.dict', '.amb', '.ann', '.bwt', '.pac', '.sa']
tumor_sequence tumor_sequence: yml file specifying the location of MT sequencing data tumor_sequence is a yml file for which to pass information regarding sequencing data for single sample (i.e. fastq files). If more than one fastq file exist for a sample, as in the case for multiple instrument data, the sequence tag is simply repeated with the additional data (see example input file). Note that in the @RG field ID and SM are required. ../types/sequence_data.yml#sequence_data[]
tumor_name tumor_name: String specifying the name of the MT sample tumor_name provides a string for what the MT sample will be referred to in the various outputs, for exmaple the VCF files. string?
normal_sequence normal_sequence: yml file specifying the location of WT sequencing data normal_sequence is a yml file for which to pass information regarding sequencing data for single sample (i.e. fastq files). If more than one fastq file exist for a sample, as in the case for multiple instrument data, the sequence tag is simply repeated with the additional data (see example input file). Note that in the @RG field ID and SM are required. ../types/sequence_data.yml#sequence_data[]
normal_name normal_name: String specifying the name of the WT sample normal_name provides a string for what the WT sample will be referred to in the various outputs, for exmaple the VCF files. string?
trimming ['../types/trimming_options.yml#trimming_options', 'null']
mills mills: File specifying common polymorphic indels from mills et al. mills provides known polymorphic indels recommended by GATK for a variety of tools including the BaseRecalibrator. This file is part of the GATK resource bundle available at http://www.broadinstitute.org/gatk/guide/article?id=1213 Essentially it is a list of known indels originally discovered by mill et al. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1557762/ File should be in vcf format, and tabix indexed. File ['.tbi']
known_indels known_indels: File specifying common polymorphic indels from 1000G known_indels provides known indels reecommended by GATK for a variety of tools including the BaseRecalibrator. This file is part of the GATK resource bundle available at http://www.broadinstitute.org/gatk/guide/article?id=1213 Essintially it is a list of known indels from 1000 Genomes Phase I indel calls. File should be in vcf format, and tabix indexed. File ['.tbi']
dbsnp_vcf dbsnp_vcf: File specifying common polymorphic indels from dbSNP dbsnp_vcf provides known indels reecommended by GATK for a variety of tools including the BaseRecalibrator. This file is part of the GATK resource bundle available at http://www.broadinstitute.org/gatk/guide/article?id=1213 Essintially it is a list of known indels from dbSNP. File should be in vcf format, and tabix indexed. File ['.tbi']
bqsr_intervals bqsr_intervals: Array of strings specifying regions for base quality score recalibration bqsr_intervals provides an array of genomic intervals for which to apply GATK base quality score recalibrations. Typically intervals are given for the entire chromosome (i.e. chr1, chr2, etc.), these names should match the format in the reference file. string[]
bait_intervals bait_intervals: interval_list file of baits used in the sequencing experiment bait_intervals is an interval_list corresponding to the baits used in sequencing reagent. These are essentially coordinates for regions you were able to design probes for in the reagent. Typically the reagent provider has this information available in bed format and it can be converted to an interval_list with Picard BedToIntervalList. Astrazeneca also maintains a repo of baits for common sequencing reagents available at https://github.com/AstraZeneca-NGS/reference_data File
target_intervals target_intervals: interval_list file of targets used in the sequencing experiment target_intervals is an interval_list corresponding to the targets for the capture reagent. Bed files with this information can be converted to interval_lists with Picard BedToIntervalList. In general for a WES exome reagent bait_intervals and target_intervals are the same. File
target_interval_padding target_interval_padding: number of bp flanking each target region in which to allow variant calls The effective coverage of capture products generally extends out beyond the actual regions targeted. This parameter allows variants to be called in these wingspan regions, extending this many base pairs from each side of the target regions. int
per_base_intervals per_base_intervals: additional intervals over which to summarize coverage/QC at a per-base resolution per_base_intervals is a list of regions (in interval_list format) over which to summarize coverage/QC at a per-base resolution. ../types/labelled_file.yml#labelled_file[]
per_target_intervals per_target_intervals: additional intervals over which to summarize coverage/QC at a per-target resolution per_target_intervals list of regions (in interval_list format) over which to summarize coverage/QC at a per-target resolution. ../types/labelled_file.yml#labelled_file[]
summary_intervals ../types/labelled_file.yml#labelled_file[]
omni_vcf File ['.tbi']
picard_metric_accumulation_level string
qc_minimum_mapping_quality int?
qc_minimum_base_quality int?
cosmic_vcf File? ['.tbi']
panel_of_normals_vcf File? ['.tbi']
strelka_cpu_reserved int?
mutect_scatter_count int
mutect_artifact_detection_mode boolean
mutect_max_alt_allele_in_normal_fraction float?
mutect_max_alt_alleles_in_normal_count int?
varscan_strand_filter int?
varscan_min_coverage int?
varscan_min_var_freq float?
varscan_p_value float?
varscan_max_normal_freq float?
pindel_insert_size int
docm_vcf The set of alleles that gatk haplotype caller will use to force-call regardless of evidence File ['.tbi']
filter_docm_variants boolean?
vep_cache_dir path to the vep cache directory, available at: https://useast.ensembl.org/info/docs/tools/vep/script/vep_cache.html#pre ['string', 'Directory']
vep_ensembl_assembly genome assembly to use in vep. Examples: GRCh38 or GRCm38 string
vep_ensembl_version ensembl version - Must be present in the cache directory. Example: 95 string
vep_ensembl_species ensembl species - Must be present in the cache directory. Examples: homo_sapiens or mus_musculus string
synonyms_file synonyms_file allows the use of different chromosome identifiers in vep inputs or annotation files (cache, database, GFF, custom file, fasta). File should be tab-delimited with the primary identifier in column 1 and the synonym in column 2. File?
annotate_coding_only if set to true, vep only returns consequences that fall in the coding regions of transcripts boolean?
vep_pick configures how vep will annotate genomic features that each variant overlaps; for a detailed description of each option see https://useast.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_allele_gene_eg ['null', {'type': 'enum', 'symbols': ['pick', 'flag_pick', 'pick_allele', 'per_gene', 'pick_allele_gene', 'flag_pick_allele', 'flag_pick_allele_gene']}]
cle_vcf_filter boolean
variants_to_table_fields The names of one or more standard VCF fields or INFO fields to include in the output table string[]
variants_to_table_genotype_fields The name of a genotype field to include in the output table string[]
vep_to_table_fields VEP fields in final output string[]
vep_custom_annotations custom type, check types directory for input format ../types/vep_custom_annotation.yml#vep_custom_annotation[]
manta_call_regions bgzip-compressed, tabix-indexed BED file specifiying regions to which manta structural variant analysis is limited File? ['.tbi']
manta_non_wgs toggles on or off manta settings for WES vs. WGS mode for structural variant detection boolean?
manta_output_contigs if set to true configures manta to output assembled contig sequences in the final VCF files boolean?
somalier_vcf a vcf file of known polymorphic sites for somalier to compare normal and tumor samples for identity; sites files can be found at: https://github.com/brentp/somalier/releases File
tumor_sample_name string
normal_sample_name string
known_variants Previously discovered variants to be flagged in this pipelines's output vcf File? ['.tbi']

Outputs

Name Label Description Type Secondary Files
tumor_cram File
tumor_mark_duplicates_metrics File
tumor_insert_size_metrics File
tumor_alignment_summary_metrics File
tumor_hs_metrics File
tumor_per_target_coverage_metrics File[]
tumor_per_target_hs_metrics File[]
tumor_per_base_coverage_metrics File[]
tumor_per_base_hs_metrics File[]
tumor_summary_hs_metrics File[]
tumor_flagstats File
tumor_verify_bam_id_metrics File
tumor_verify_bam_id_depth File
normal_cram File
normal_mark_duplicates_metrics File
normal_insert_size_metrics File
normal_alignment_summary_metrics File
normal_hs_metrics File
normal_per_target_coverage_metrics File[]
normal_per_target_hs_metrics File[]
normal_per_base_coverage_metrics File[]
normal_per_base_hs_metrics File[]
normal_summary_hs_metrics File[]
normal_flagstats File
normal_verify_bam_id_metrics File
normal_verify_bam_id_depth File
mutect_unfiltered_vcf File ['.tbi']
mutect_filtered_vcf File ['.tbi']
strelka_unfiltered_vcf File ['.tbi']
strelka_filtered_vcf File ['.tbi']
varscan_unfiltered_vcf File ['.tbi']
varscan_filtered_vcf File ['.tbi']
pindel_unfiltered_vcf File ['.tbi']
pindel_filtered_vcf File ['.tbi']
docm_filtered_vcf File ['.tbi']
final_vcf File ['.tbi']
final_filtered_vcf File ['.tbi']
final_tsv File
vep_summary File
tumor_snv_bam_readcount_tsv File
tumor_indel_bam_readcount_tsv File
normal_snv_bam_readcount_tsv File
normal_indel_bam_readcount_tsv File
intervals_antitarget File?
intervals_target File?
normal_antitarget_coverage File
normal_target_coverage File
reference_coverage File?
cn_diagram File?
cn_scatter_plot File?
tumor_antitarget_coverage File
tumor_target_coverage File
tumor_bin_level_ratios File
tumor_segmented_ratios File
diploid_variants File? ['.tbi']
somatic_variants File? ['.tbi']
all_candidates File ['.tbi']
small_candidates File ['.tbi']
tumor_only_variants File? ['.tbi']
somalier_concordance_metrics File
somalier_concordance_statistics File

Steps

Name CWL Run
tumor_alignment_and_qc pipelines/alignment_exome.cwl
normal_alignment_and_qc pipelines/alignment_exome.cwl
concordance tools/concordance.cwl
pad_target_intervals tools/interval_list_expand.cwl
detect_variants pipelines/detect_variants.cwl
cnvkit tools/cnvkit_batch.cwl
manta tools/manta_somatic.cwl
tumor_bam_to_cram tools/bam_to_cram.cwl
tumor_index_cram tools/index_cram.cwl
normal_bam_to_cram tools/bam_to_cram.cwl
normal_index_cram tools/index_cram.cwl
Clone this wiki locally