A Snakemake workflow for DNM (de novo mutation) calling.
- Snakemake workflow: TriosCompass
- Overview
- User's guides
- I. Installation
- II. Inputs
- III. Configure files
- V. Run TriosCompass
- IV. Outputs
TriosCompass consists of four functional components:
- Call DNMs (de novo mutations) using DeepVariant and GATK HaplotypeCaller
- Phase DNMs using whatshap
- Call dnSTR (de novo simple tandem repeats) using HipSTR and MonSTR
- Call dnSV (de novo structural variants) using Manta, GraphType2 and smoove
The overall workflow diagram of TriosCompass is as below:
All required bioinformatics tools are wrapped as conda, container and etc, so TriosCompass is portable and easy to be deployed.
Nevertheless, there are still some basic dependencies required to start any Snakemake workflow (e.g., conda and python), which have been specified in environment.yaml. Users can create a new conda env for TriosCompassV2:
mamba env create -f environment.yaml
conda activate TriosCompassV2
Besides, singularity needs to be installed globally, and the details of the TriosCompass installation can be found below.
The DNM candidates are jointly called by both DeepVariant (DV) and GATK HaplotypeCaller (HC) in the specified callable regions, then filtered using sliver:
"denovo:( \
( \
(variant.CHROM == 'chrX' && kid.sex=='male') && \
kid.hom_alt && kid.AB > 0.98 \
) || \
( \
(!(variant.CHROM == 'chrX' && kid.sex=='male')) && \
kid.het && kid.AB > 0.25 && kid.AB < 0.75 \
) \
) && (kid.AD[0]+kid.AD[1]) >= {params.min_dp}/(1+(variant.CHROM == 'chrX' && kid.sex == 'male' ? 1 : 0)) && \
mom.hom_ref && dad.hom_ref \
&& (mom.AD[1] + dad.AD[1]) <= 5 \
&& kid.GQ >= {params.min_gq} && mom.GQ >= {params.min_gq} && dad.GQ >= {params.min_gq} \
&& (mom.AD[0]+mom.AD[1]) >= {params.min_dp} && (dad.AD[0]+dad.AD[1]) >= {params.min_dp}/(1+(variant.CHROM == 'chrX' ? 1 : 0))"
- (!(variant.CHROM == 'chrX' && kid.sex=='male')) && kid.het && kid.AB > 0.25 && kid.AB < 0.75
- In the normal cases, select variants with genotype of "0/1" and allele balance in the range of 0.25 and 0.75.
- (variant.CHROM == 'chrX' && kid.sex=='male') && alt && kid.AB > 0.98
- In the special case (that is, variant in chrX in the male offspring), select variants with the genotype of "1/1" and allele balance > 0.98.
- (kid.AD[0]+kid.AD[1]) >= {params.min_dp}/(1+(variant.CHROM == 'chrX' && kid.sex == 'male' ? 1 : 0))
- The kid's variant should have depth over the minimal read depth {params.min_dp}. If the variant is on chrX and the kid's gender is male, the minimum read depth requirement is reduced to its half.
- (mom.AD[0]+mom.AD[1]) >= {params.min_dp} && (dad.AD[0]+dad.AD[1]) >= {params.min_dp}/(1+(variant.CHROM == 'chrX' ? 1 : 0))
- Similar read depth requirement is also applied for the parents.
- mom.hom_ref && dad.hom_ref
- The genotypes of the parents should both be "0/0";
- (mom.AD[1] + dad.AD[1]) <= 5
- The total read support of the alternative allele (that is, the DNM) should be not bigger than 5 in parents.
- kid.GQ >= {params.min_gq} && mom.GQ >= {params.min_gq} && dad.GQ >= {params.min_gq}
- GQ score of the variants should be no less than {params.min_gq} in all the members of the trio.
π In the workflow, the parameters {params.min_dp} and {params.min_gq} can be configured via config/config.yaml, so are the callable regions:
-
config/config.yaml
call_dnm: interval: "ref/hg38.wgs_interval.bed" # DNM callable regions dv: min_gq: 3 min_dp: 20 hc: min_gq: 20 min_dp: 30
We identified parental origin of DNMs using WhatsHap.
There are two running modes in WhatsHap: individual and pedigree. In the individual mode, variants are phased into haplotype blocks by WhatsHap, without parental origin information. Pedigree mode is ideal to identify parental origin of variants in the child. However, WhatsHap does not phase DNMs in the pedigree mode, as DNMs do not follow mendelian inheritance. Therefore, phasing DNMs is not supported by WhatsHap at present (see this issue for the details). Nevertheless, a DNM could be phased with the other informative germline variants into the same haplotype block in the individual mode, and thereby all variants in the block share the same parental origin. Therefore, we managed to run WhatsHap in both individual mode and pedigree mode, and extract parental original from the two outputs.
We first explored to run WhatsHap on the whole genome using the workflow Snakefile_whatshap_WG. It turns out to be very slow: it took up to 7 days to process one trio family, and 2-3 days for each trio in average. Finally, we chose to phase a 10 Kb window around each DNM.
We developed a Perl script to identify the parental origin, and the aggregated results are output to the file: output/phase_DNMs/{Trios_ID}.parental_origin.tab. The output file is a tab delimited text file, with 9 columns including variant id of DNM, parental origin prediction.
The perl script extracts the haplotype block containing the DNMs from the output of WhatsHap phasing child (i.e., the individual mode), and collects the phased variants of child in the block. The script also repeats the processing on the output of WhatsHap phasing trios. The phase prediction from the latter is given as F|M, i.e, the first allele is the one inherited from the father (F) and the other in inherited from the mother (M). However, the parental origin prediction from the individual mode is not certain, that is, it could be F|M or M|F. Ideally, the phases of the variants in each haplotype block from WhatsHap phasing child only should be either all identical (FM count) or all opposite (MF count) to those from WhatsHap phasing trios. In such cases, we can certainly identify the parental origin of the block and so is that of the DNM in the same block. We simply assign parental origin as Not Determined (ND) if there is inconsistency of FM statuses among the variants in the same block (i.e., both FM count > 0 and MF count > 0).
- Example of the output of extract_parental_origin.pl
VariantID | Parental Origin | Phase | Haplotype Block Size | # Informative Sites | FM count | MF count | FM status | Parent origin change |
---|---|---|---|---|---|---|---|---|
chr1:208903875:G:A | ND | 1|0 | 36 | 0 | 25 | 11 | ND | ND=>ND |
chr2:12194398:A:C | paternal | 1|0 | 2 | 2 | 2 | 0 | FM | paternal=>paternal |
chr2:14986429:A:T | paternal | 1|0 | 5 | 5 | 5 | 0 | FM | paternal=>paternal |
chr2:50677044:C:T | ND | 0/1 | 0 | 0 | 0 | 0 | ND | ND=>ND |
chr2:139840973:T:A | paternal | 0|1 | 14 | 14 | 0 | 14 | MF | paternal=>paternal |
chr2:154745781:T:C | ND | 0/1 | 0 | 0 | 0 | 0 | ND | ND=>ND |
chr2:203705880:T:TTCTTTC | ND | 0/1 | 0 | 0 | 0 | 0 | ND | ND=>ND |
chr2:220449146:A:G | ND | 0/1 | 0 | 0 | 0 | 0 | ND | ND=>ND |
chr3:14911256:G:C | ND | 0/1 | 0 | 0 | 0 | 0 | ND | ND=>ND |
chr3:76410708:A:C | ND | 0/1 | 0 | 0 | 0 | 0 | ND | ND=>ND |
chr3:185020246:C:T | ND | 0/1 | 0 | 0 | 0 | 0 | ND | ND=>ND |
chr3:194728646:G:GTGTGTGTGTGTGTC | maternal | 0|1 | 6 | 6 | 6 | 0 | FM | maternal=>maternal |
chr3:197879976:C:T | paternal | 0|1 | 3 | 2 | 0 | 1 | MF | paternal=>paternal |
chr3:197879978:C:A | paternal | 0|1 | 3 | 2 | 0 | 1 | MF | paternal=>paternal |
chr4:29657055:GC:G | ND | 0/1 | 0 | 0 | 0 | 0 | ND | ND=>ND |
π For most users, the first two columns provide essential information about parental origins of the predicted DNMs.
- Descriptions for each output column.
Column position | Column Header | Description | Notes |
---|---|---|---|
1 | VariantID | Variant ID of DNM | |
2 | Parental Origin | Predicted parental origin | ND (not determined) |
3 | Phase | Phased genotype of DNMs (from phasing child) | Parental origin cannot be determined from the output from phasing child only. |
4 | Haplotype Block Size | Size of the haplotype block (from phasing child) | Prediction from a large block is more reliable. |
5 | # Informative Sites | # informative sites in the haplotype block | Prediction is not reliable if # informative sites is 0. |
6 | FM count | # phased variants supporting F|M genotype | |
7 | MF count | # phased variants supporting M|F genotype | |
8 | FM status | FM (for F|M), MF (for M|F), ND (not determined) | FM status is ND if both FM and MF counts are bigger than 0. |
9 | Parental origin change | Prediction (wo phasing trios) => Prediction(wi phasing trios) | Generally, the prediction with phasing child only is consistent with that with both phasing child and phasing trios. |
- config.yaml setting for phasing
phasing: window_size: 10000 # the perl script path relative to the working directory perl_cmd: "perl TriosCompass_v2/workflow/scripts/extract_parental_origin.pl "
We had explored to call dnSTR candidates jointly by HipSTR and GangSTR. During the manual curation, we found that the joint predicts are good but much fewer than expected, and the HipSTR outperforms GangSTR in our manual evaluation. Therefore, we gave up GangSTR, and employed HipSTR/MonSTR in the dnSTR prediction.
At present, our dnSTR calling process is as below:
- The STR reference panel is split into N chunks.
- For each chunk: a. STRs of all samples (from different trios) are jointly genotyped by HipSTR. b. Then, to be filtered by dumpSTR. c. MonSTR is applied to call dnSTRs.
- Chunks of MonSTR is merged and then filtered further.
Besides, VizAln (from the HipSTR package) is employed to generated visualization in HTML format.
Below is the configure setting for dnSTR in config/config.yaml:
dnSTR:
# split bed into chunks to speed up dnSTR call
split_n: 400
dup_reg: "ref/STR/GRCh38GenomicSuperDup.bed.gz" # come with GRCh38GenomicSuperDup.bed.gz.tbi
hipstr:
enable: True
ref_panel: "ref/STR/hg38_ver13.hipstr_9.bed"
monstr_filter: " --min-span-coverage 3 --min-supp-reads 3 "
dumpstr_call_args: >
--hipstr-min-call-DP 15
--hipstr-max-call-DP 1000
--hipstr-min-call-Q 0.9 --drop-filtered
--vcftype hipstr --hipstr-min-supp-reads 1
--hipstr-max-call-flank-indel 0.15
--hipstr-max-call-stutter 0.15
# gangstr:
# enable: False
# ref_panel: "ref/STR/hg38_ver13.le9.bed"
# filter: " --max-perc-encl-parent 0.05 --min-encl-match 0.9 --min-total-encl 10 --gangstr "
dnSVs are predicted jointly by two approaches in TriosCompass:
- smoove/lumpy-sv
- manta + GraphType2
π Please note:
-
Insertion is marked as translocation (i.e., "BND") in smoove/lumpy-sv.
-
Bam files (not cram) are recognized by lumpy-sv.
-
Sample order in the ped file matters.
By default, we had built ped files in the order as: father, mother, and child. Accordingly, the same order is remained in the jointly genotyped SV VCF file. Therefore, we used the bcftools command to identify dnSV as below:
bcftools view -i 'GT[2]="het" && GT[1]="RR" && GT[0]="RR" ' -O v {input} -o {output}
- Install singularity globally.
- Install conda and mamba.
- Create the workspace directory $WORKSPACE and change directory to $WORKSPACE.
- Git clone the TriosCompass_v2 repository.
git clone https://github.com/NCI-CGR/TriosCompass_v2.git
- Create new conda environment to run Snakemake workflow
mamba env create -f TriosCompass_v2/environment.yaml
Folder structure of your workspace is as follows:
$WORKSPACE
βββ TriosCompass_v2
User can put the reference genome any location under the folder $WORKSPACE and specify its relative location at config/config.yaml.
For instance, we may put the hg38 human genome Homo_sapiens_assembly38.fasta under the folder $WORKSPACE/ref/.
-
config/config.yaml
ref: sequence: "ref/Homo_sapiens_assembly38.fasta" build: "hg38"
The interval file (in bed format) is to specify regions to call DNMs, which should be matched with the reference genome. For instance, we used the file hg38.wgs_interval.bed for the reference genome hg38, which was converted from resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list. The latter is part of resource bundle hosted by the Broad Institute. There are severals way to retrieve the interval file, for example, ftp://[email protected]/bundle/hg38/wgs_calling_regions.hg38.interval_list.
### Download calling region of hg38 from the Broad Institute
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/wgs_calling_regions.hg38.interval_list -O ref/resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list
### Then, convert the interval list to bed file using Picard
module load picard
java -jar $PICARDJAR IntervalListToBed -I ref/resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list -O ref/hg38.wgs_interval.bed
Users can specify ref/hg38.wgs_interval.bed in config/config.yaml
call_dnm:
interval: "ref/hg38.wgs_interval.bed"
We had developed joint call of dnSTR using both GangSTR and HipSTR, so we had prepared the same STR reference panels for both of the two programs.
Both GangSTR and HipSTR use STRs specified in the STR reference panel, but with the different formats. One paper suggested that there is good consistence between HipSTR and GangSTR.
Oketch, J. W., Wain, L. V, Hollox, E. J., & Hollox, E. (2022). A comparison of software for analysis of rare and common short tandem repeat (STR) variation using human genome sequences from clinical and population-based samples. 1β22. https://doi.org/10.1101/2022.05.25.493473
We would like to have the STR genotypes predicted jointly by both of the two program, therefore we need create one common reference panel file for both of GangSTR and HipSTR.
### Get the GangSTR reference panel
wget https://s3.amazonaws.com/gangstr/hg38/genomewide/hg38_ver13.bed.gz -O STR/hg38_ver13.bed.gz
gzip -d STR/hg38_ver13.bed.gz
### reformat for HipSTR
awk -v OFS='\t' '{print $1,$2,$3,$4,($3-$2+1)/$4,"GangSTR_STR_"NR,$5}' STR/hg38_ver13.bed > STR/hg38_ver13.hipstr.bed
### HipSTR cannot take STR with unit length > 9 bp
awk -v OFS='\t' '{if($4<=9) print $0}' STR/hg38_ver13.hipstr.bed > STR/hg38_ver13.hipstr_9.bed
### We put the same length restrict to hg38_ver13.bed.gz, so that both of the two reference panels are consistent.
awk -v OFS='\t' '{if($4<=9) print $0}' STR/hg38_ver13.bed > STR/hg38_ver13.le9.bed
π
- After manual curation, we chose to use HipSTR only for the dnSTR prediction.
- The STR reference panel is specified in config/config.yaml
- ref_panel: "ref/STR/hg38_ver13.hipstr_9.bed"
For the human genome hg38, the file GRCh38GenomicSuperDup.bed was obtained using the UCSC Table Browser (hg38.genomicSuperDups table). It needs to be further sorted, compressed and indexed as shown below:
bedtools sort -i GRCh38GenomicSuperDup.bed | bgzip -c > GRCh38GenomicSuperDup.bed.gz
tabix -p bed GRCh38GenomicSuperDup.bed.gz
-
The setting can also be customized in config/config.yaml
dnSTR: # split bed into chunks to speed up dnSTR call split_n: 400 dup_reg: "ref/STR/GRCh38GenomicSuperDup.bed.gz" # come with GRCh38GenomicSuperDup.bed.gz.tbi
We used exclude.cnvnator_100bp.GRCh38.20170403.bed for regions to be excluded in dnSV calling, as suggested at brentp/smoove:
wget https://raw.githubusercontent.com/hall-lab/speedseq/master/annotations/exclude.cnvnator_100bp.GRCh38.20170403.bed -O ref/exclude.cnvnator_100bp.GRCh38.20170403.bed
-
Settings in config.yaml
dnSV: enable: True exclude_bed: "ref/exclude.cnvnator_100bp.GRCh38.20170403.bed"
After the installation of the above resource files required by TriosCompass to ref/, users may have the folder ref/ under $WORKSPACE as below:
$WORKSPACE/ref
βββ exclude.cnvnator_100bp.GRCh38.20170403.bed
βββ hg38.wgs_interval.bed
βββ Homo_sapiens_assembly38.fasta
βββ STR
βββ GRCh38GenomicSuperDup.bed.gz
βββ GRCh38GenomicSuperDup.bed.gz.tbi
βββ hg38_ver13.bed
βββ hg38_ver13.hipstr_9.bed
βββ hg38_ver13.le9.bed
A resource bundle for hg38 is available at here as a reference for users of TriosCompass.
PEP is employed to import metadata of the NGS data, which can be either Fastq or BAM files. A PEP usually consists 3 files:
- One yaml file for the metadata.
- One csv file for the sample information.
- Another yaml (optional) to define schema to validate the metadata.
-
config/fastq_pep.yaml
pep_version: 2.0.0 sample_table: sample_fastq.csv # In manifest file, Sample_ID + Flowcell should be unique sample_modifiers: append: sample_name: "sn" derive: attributes: [sample_name] sources: sn: "{SAMPLE_ID}_{FLOWCELL}"
-
config/sample_fastq.csv
SAMPLE_ID,FLOWCELL,LANE,INDEX,R1,R2 HG002,BH2JWTDSX5,1,CGGTTGTT-GTGGTATG,data/fq/HG002_NA24385_son_80X_R1.fq.gz,data/fq/HG002_NA24385_son_80X_R2.fq.gz HG003,BH2JWTDSX5,1,GCGTCATT-CAGACGTT,data/fq/HG003_NA24149_father_80X_R1.fq.gz,data/fq/HG003_NA24149_father_80X_R2.fq.gz HG004,BH2JWTDSX5,1,CTGTTGAC-ACCTCAGT,data/fq/HG004_NA24143_mother_80X_R1.fq.gz,data/fq/HG004_NA24143_mother_80X_R2.fq.gz
-
workflow/schemas/fastq_schema.yaml
description: A example schema for a pipeline. imports: - http://schema.databio.org/pep/2.0.0.yaml # - TriosCompass_v2/workflow/schemas/2.0.0.yaml properties: samples: type: array items: type: object properties: SAMPLE_ID: type: string description: "sample id" FLOWCELL: type: string description: "Flowcell" INDEX: type: string description: "Library index" LANE: type: string description: "Lane number in flowcell" enum: ["1", "2"] R1: type: string description: "path to the R1 fastq file" R2: type: string description: "path to the R2 fastq file" required: - FLOWCELL - SAMPLE_ID - INDEX - R1 - R2
π In this example, 6 columns in the sample csv file are required: SAMPLE_ID, FLOWCELL, LANE, INDEX, R1, R2. Such information is mainly used to build RG (read group) tag in the bam file:
@RG PL:ILLUMINA ID:{FLOWCELL}_{LANE} SM:{SAMPLE_ID} PU:{SAMPLE_ID}_{FLOWCELL} LB:{SAMPLE_ID}_{INDEX}
Users may put dummy data if some of the information is not available, and {SAMPLE_ID} needs not to be unique in the sample table (as some samples might be sequenced in multiple flow cells to meet the coverage requirement). Nevertheless, the combination of {SAMPLE_ID} and {FLOWCELL} must be unique.
Users may put additional meta information in the sample csv file, which, however, will be ignored by TriosCompass.
The essential information for bam input is the sample id and the bam file location. Therefore, the metadata is much simpler compared to fastq input.
-
config/bam_pep.yaml
pep_version: 2.0.0 sample_table: sample_bam.csv sample_modifiers: append: sample_name: "sn" derive: attributes: [sample_name] sources: sn: "{SAMPLE_ID}"
-
config/sample_bam.csv
SAMPLE_ID,BAM HG002,sorted_bam/HG002_NA24385_son_80X.bam HG003,sorted_bam/HG003_NA24149_father_80X.bam HG004,sorted_bam/HG004_NA24143_mother_80X.bam
-
workflow/schemas/bam_schema.yaml
description: A example schema for a pipeline. imports: - http://schema.databio.org/pep/2.0.0.yaml # - TriosCompass_v2/workflow/schemas/2.0.0.yaml properties: samples: type: array items: type: object properties: SAMPLE_ID: type: string description: "sample id" BAM: type: string description: "path to the bam file" required: - SAMPLE_ID - BAM
π {SAMPLE_ID} should be unique and matched with the RG tag in the bam file in TriosCompass. If the bam file has RG different from {SAMPLE_ID}, users may activate reset_RG option so as to update all RG tag properly in the bam files.
- config/config.yaml
bam_input:
reset_RG: True
In addition to the NGS input data, another important input is pedigree files to define trios.
π There are several requirements about pedigree files from TriosCompass:
- Each trio is specified by one pedigree file;
- All pedigree files are named by the unique family ID, with ".ped" as the file extension, under the directory which is specified by config/config.yaml:
ped_dir: "ped"
- Rows for the trio members are in the order father-mother-child in the pedigree file.
- Trio member identifiers are matched with {SAMPLE_ID} as defined in the sample CSV filem and also matched with the RG tags of the bam files.
Below is an example of the pedigree file for the GIAB AJ family:
-
ped/AJ.ped
AJ HG003 0 0 1 1 AJ HG004 0 0 2 1 AJ HG002 HG003 HG004 1 1
fastq/ # for fastq input example
βββ HG002_NA24385_son_80X_R1.fq.gz
βββ HG002_NA24385_son_80X_R2.fq.gz
βββ HG003_NA24149_father_80X_R1.fq.gz
βββ HG003_NA24149_father_80X_R2.fq.gz
βββ HG004_NA24143_mother_80X_R1.fq.gz
βββ HG004_NA24143_mother_80X_R2.fq.gz
sorted_bam/ # for bam input example
βββ HG002_NA24385_son_80X.bam
βββ HG002_NA24385_son_80X.bam.bai
βββ HG003_NA24149_father_80X.bam
βββ HG003_NA24149_father_80X.bam.bai
βββ HG004_NA24143_mother_80X.bam
βββ HG004_NA24143_mother_80X.bam.bai
ref/
βββ hg38.wgs_interval.bed
βββ Homo_sapiens_assembly38.fasta
βββ STR/
ped/
βββ AJ.ped
TriosCompass_v2
βββ config/
βββ data/
βββ environment.yaml
βββ img/
βββ LICENSE
βββ README.md
βββ workflow/
We have introduced PEP yaml file to specify the NGS input files of TriosCompass. There are two other yaml files, which play essential roles in Snakemake workflows:
- the profile config.yaml file
- the configfile config.yaml file
The profile config.yaml file has multiple function roles in Snakemake:
- Define computing resources and threads.
- Specify command-line options required to start Snakemake workflow properly.
- Submit Snakemake jobs to the cluster.
We have provided an example profile to launch TriosCompass in a Slurm cluster.
π To separate the workspace from TriosCompass, we set $WORKSPACE as the working directory, where TriosCompass_v2 (the locally cloned repo) is a sub-folder of $WORKSPACE. Accordingly, the command below is recommended:
snakemake --profile TriosCompass_v2/workflow/profiles/slurm --configfile TriosCompass_v2/config/config.yaml
π There is a bug in the release of Snakemake we are testing (Version 7.3.7), so that the section set-threads does not work properly in the profile config.yaml file. We had used config/config.yaml as a work-around solution to specify threads for each Snakemake rule.
This configure file is for users to customized TriosCompass. In particular, users may it to enable (or disable) certain optional components of TriosCompass. Most of sections of config.yaml have been already introduced in the section Inputs. A complete example of config/config.yaml has been provided for users as a template.
A check list to launch TriosCompass
- Install singularity to the front-end nodes and working nodes of the cluster.
- Create new directory as work space and change directory to it.
- Git clone TriosCompass_v2.
- Install conda/mamba and activate the conda environment TriosCompassV2 using TriosCompass_v2/environment.txt
- Prepare and install resource bundle for the specific reference genome.
- Configure profile settings to submit jobs to the cluster.
- Move NGS input data under the work space.
- Prepare the proper PEP files for the NGS input.
- Configure TriosCompass_v2/config/config.yaml for the new run.
Finally, the command can be launched in this way:
### Assume $WORKSPACE is your working directory
cd $WORKSPACE
conda activate TriosCompassV2
module load singularity # or load singularity is available in alternative way
mkdir -p TMP
export TMPDIR=$WORKSPACE/TMP
snakemake --profile TriosCompass_v2/workflow/profiles/slurm --configfile TriosCompass_v2/config/config.yaml
The location of the output directory is also specified by config/config.yaml:
output_dir: "output"
Most of the output from TriosCompass will be saved to $WORKSPACE/output based on the configure setting above. Under the output folder, there are dozens of sub-folders for the (intermediate) outputs of TriosCompass.
ls $WORKSPACE/output/
benchmark dnSV gatk_genotype_gvcf_pb manta_sv splitted_panel
call_JIGV dnSV_summary gatkhc_pb merge_monstr svimmer
collectmultiplemetrics dumpstr_call glnexus merge_ped trio_manta_list
collectwgsmetrics dumpstr_locus graphtyper monstr vizaln
deepvariant_pb flagstat graphtyper_filter monstr_filter
dnm_vcf gatk_cgp hipstr phase_DNMs
dnm_vcf_summary gatk_combine_gvcf joint_dnSV slivar
dnSTR_summary GATK_DV manta smoove
There are 4 major outputs from TriosCompass:
- MultiQC reports
- DNM predictions, parental origins and visualizations
- dnSTR predictions and visulizations
- dnSV predictions
The MultiQC report is a colletion of QC metrics of NGS data.
Location of MultiQC output is specified by config["multiqc"]["output_dir"]:
multiqc:
enable: True
output_dir: "MultiQC_output"
tree MultiQC_output/
MultiQC_output/
βββ multiqc_data
β βββ multiqc_citations.txt
β βββ multiqc_data.json
β βββ multiqc_general_stats.txt
β βββ multiqc.log
β βββ multiqc_sources.txt
β βββ picard_alignment_readlength_plot.txt
β βββ picard_alignment_summary_Aligned_Bases.txt
β βββ picard_alignment_summary_Aligned_Reads.txt
β βββ picard_base_distribution_by_cycle__Adenine.txt
β βββ picard_base_distribution_by_cycle__Cytosine.txt
β βββ picard_base_distribution_by_cycle__Guanine.txt
β βββ picard_base_distribution_by_cycle__Thymine.txt
β βββ picard_base_distribution_by_cycle__Undetermined.txt
β βββ picard_gcbias_plot.txt
β βββ picard_insert_size_Counts.txt
β βββ picard_insert_size_Percentages.txt
β βββ picard_quality_by_cycle.txt
β βββ picard_quality_score_distribution.txt
β βββ samtools-flagstat-dp_Percentage_of_total.txt
β βββ samtools-flagstat-dp_Read_counts.txt
β βββ samtools-idxstats-mapped-reads-plot_Normalised_Counts.txt
β βββ samtools-idxstats-mapped-reads-plot_Observed_over_Expected_Counts.txt
β βββ samtools-idxstats-mapped-reads-plot_Raw_Counts.txt
β βββ samtools-idxstats-xy-plot.txt
βββ multiqc.html
The final DNM candidates are predicted by both DV and HC. And we extract the variant information from DV output and save it as {output}/dnm_vcf/{fam}.dnm.vcf.gz, where {output} and {fam} are placeholders for the output directory and trio ID, respecitively.
Besides, the counts of DNMs for all trios are summarised in the file {output}/dnm_vcf_summary/DNM_summary.txt.
The output of parental origins of DNMs is a tab-delimited test file, availalbe at {output}/phase_DNMs/{fam}.parental_origin.tab .
The format of this output is as described, and most of users may just focus on first two columns of the file.
JIGV snapshots have generated for each DNM, available as {output}/call_JIGV/{fam}.JIGV.html .
The dnSTRs predicted by HipSTR/MonSTR are available as {output}/monstr_filter/hipstr.filtered.tab . The output format of the MonSTR prediction is described here.
We have also generated visulization for dnSTR using VizAln (from HipSTR package). Each dnSTR of one trio has its own html file under {output}/vizaln/{fam}/hipstr/variants/.
tree output/vizaln/t0311/hipstr
output/vizaln/t0311/hipstr
βββ DONE
βββ variants
βββ chr10_103294239.dnm
βββ chr10_103294239.html
βββ chr10_34375054.dnm
βββ chr10_34375054.html
βββ chr10_46899557.dnm
βββ chr10_46899557.html
βββ chr10_50707597.dnm
βββ ...
π The ".dnm" files are dummy files for the use of Snakemake workflow in the process of scatter-gather dnSTRs.
An example of VizAln realignment is available here.
In this particular example, the identifiers of child, father and mother are SC260721, SC260720 and SC260670, respectively, from the family t0612.
cat ped_files/t0612.ped
t0612 SC260720 0 0 1 1
t0612 SC260670 0 0 2 1
t0612 SC260721 SC260720 SC260670 2 1
From the MonSTR prediction, we have:
chrom | pos | period | child | newallele | mutsize | child_gt | mat_gt | pat_gt |
---|---|---|---|---|---|---|---|---|
6 | 38571975 | 2 | SC260721 | 13 | 2 | 13,13 | 11,13 | 11,11 |
It suggests there is an de novo tandem repeat mutation started at the position chr6:38571975 in the subject SC260721. The unit length of repeat (period) is 2, and has 13 repeat units (newallele) in the child SC260721, which is 2 more than the reference genome (mutsize; i.e., the wild type genotype is [11,11]). The MonSTR prediction indicated that the genotypes of child, father and mother are [13,13], [11,11] and [11,13], respectively.
In the VizAln html page, we have genotypes of the family members marked in different ways. For example, "SC260670: 0|4" means one allele is wild type and another is 4 bp insertion. Negative values stands for deletion (read this for some details). Therefore, "SC260721: 4|4", "SC260720: 0|0" and "SC260670: 0|4" are well matched with the MonSTR prediction: [13,13], [11,11] and [11,13] for child, father and mother,respectively, in this example.
π Please note that the alignments from VizAln were arranged by the alphabet order of the sample identifiers.
As aforementioned, dnSVs are jointly predicted by Smoove and Manta/TypeGraph2. The results are merged VCF files generated by SURVIVOR: {output}/joint_dnSV/{fam}.dnSV.vcf:
-
output/joint_dnSV/t0311.dnSV.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SC501095 SC501095_1 chr18 55923597 chr18:55923597:OG N ]chr11:13001662]N 239 PASS SUPP=2;SUPP_VEC=11;SVLEN=0;SVTYPE=TRA;SVMETHOD=SURVIVOR1.0.7;CHR2=chr11;END=13001651;CIPOS=-107,0;CIEND=0,11;STRANDS=++ GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/1:NA:42921828:0,13:--:239:TRA:668279_2:NA:NA:chr18_55923490-chr11_13001662 0/1:NA:42921946:0,0:++:217:TRA:chr18_55923597_OG:NA:NA:chr18_55923597-chr11_13001651 chr20 37068816 chr20:37068816:OG N ]chr11:5933976]N 255 PASS SUPP=2;SUPP_VEC=11;SVLEN=0;SVTYPE=TRA;SVMETHOD=SURVIVOR1.0.7;CHR2=chr11;END=5933971;CIPOS=-32,0;CIEND=0,5;STRANDS=++ GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/1:NA:31134808:0,20:-+:80:TRA:668295_2:NA:NA:chr20_37068784-chr11_5933976 0/1:NA:31134845:0,0:++:255:TRA:chr20_37068816_OG:NA:NA:chr20_37068816-chr11_5933971
π Two sample columns are available for the child (e.g.,SC501095) in the example vcf output: the first one is from Smoove and the other is from Manta/TypeGraph2
Besides, a summary of dnSV count is available as {output}/dnSV_summary/dnSV_summary.txt.
Reports is a very useful feature of Snakemake. We utilized this feature to package TriosCompass outputs into a zip file and index them with the HTMP page.
Below is an example command to generate reports for TriosCompass:
snakemake --report TriosCompass_full_report.zip --profile TriosCompass_v2/workflow/profiles/slurm --configfile TriosCompass_v2/config/config.yaml
- A snapshot of the report HTML page.