Clinical research pipeline for exploring variants in whole genome sequencing data
crg2 is a research pipeline aimed at discovering clinically relevant variants (SNVs, SVs) in whole genome sequencing data. It aims to provide reproducible results, be computationally efficient, and transparent in it's workflow.
crg2 uses Snakemake and Conda to manage jobs and software dependencies.
- Download and setup Anaconda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
- Install Snakemake 5.10.0 via Conda: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
- Git clone this repo
- Make a directory for Conda to install all it's environments and executables in, for example:
mkdir ~/crg2-conda
- Navigate to the crg2 directory. Install all software dependencies using:
cd crg2
snakemake --use-conda -s Snakefile --conda-prefix ~/crg2-conda --create-envs-only
Make sure to replace ~/crg2-conda
with the path made in step 4. This will take a while.
-
Install these plugins for VEP:
LoF, MaxEntScan, SpliceRegion
. Refer to this page for installation instructions: https://useast.ensembl.org/info/docs/tools/vep/script/vep_plugins.html. The INSTALL.pl script has been renamed to vep_install in the VEP's Conda build. It is located in the conda environment directory, undershare/ensembl-vep-99.2-0/vep_install
. Therefore, your command should be similar to:fb5f2eb3/share/ensembl-vep-99.2-0/vep_install -a p --PLUGINS LoF,MaxEntScan,SpliceRegion
-
Git clone cre:
git clone https://github.com/ccmbioinfo/cre
to a safe place -
Replace the VEP path's to the VEP directory installed from step 6. Replace the cre path in crg2/config.yaml with the one from step 7.
- Make a folder in a directory with sufficient space. Copy over the template files samples.tsv, units.tsv, config.yaml. You may need to re-copy config.yaml if the file was recently updated in repo for previous crg2 runs.
mkdir NA12878
cp crg2/samples.tsv crg2/units.tsv crg2/config.yaml NA12878
- Reconfigure the 3 files to reflect project names, sample names, input fastq files, a panel bed file (if any) and a ped file (if any). Inclusion of a panel bed file will generate 2 SNV reports with all variants falling within these regions. Inclusion of a ped file currently does nothing except create a gemini db with the pedigree data stored in it.
samples.tsv
sample
NA12878
units.tsv
sample unit platform fq1 fq2
NA12878 1 ILLUMINA /hpf/largeprojects/ccm_dccforge/dccdipg/Common/NA12878/NA12878.bam_1.fq /hpf/largeprojects/ccm_dccforge/dccdipg/Common/NA12878/NA12878.bam_2.fq
config.yaml
run:
project: "NA12878"
samples: samples.tsv
units: units.tsv
ped: "" # leave this line blank if there is no ped
panel: "/hpf/largeprojects/ccmbio/dennis.kao/NA12878/panel.bed" # remove this line entirely if there is no panel bed file
flank: "100000"
cre: /hpf/largeprojects/ccm_dccforge/dccdipg/Common/pipelines/cre
ref:
name: GRCh37.75
...
- Active the conda environment with Snakemake 5.10.0
(base) [dennis.kao@qlogin5 crg2]$ conda activate snakemake
(snakemake) [dennis.kao@qlogin5 crg2]$ snakemake -v
5.10.0
- Test that the pipeline will run by adding the flag "-n" to the command in dnaseq.pbs.
(snakemake) [dennis.kao@qlogin5 crg2]$ snakemake --use-conda -s /hpf/largeprojects/ccm_dccforge/dccdipg/Common/pipelines/crg2/Snakefile --cores 4 --conda-prefix ~/crg2-conda -n
Building DAG of jobs...
Job counts:
count jobs
1 all
86 call_variants
86 combine_calls
86 genotype_variants
2 hard_filter_calls
1 map_reads
1 merge_calls
1 merge_variants
1 recalibrate_base_qualities
2 select_calls
1 snvreport
1 vcf2db
1 vcfanno
1 vep
1 vt
272
[Tue May 12 10:56:12 2020]
rule map_reads:
input: /hpf/largeprojects/ccm_dccforge/dccdipg/Common/NA12878/NA12878.bam_1.fq, /hpf/largeprojects/ccm_dccforge/dccdipg/Common/NA12878/NA12878.bam_2.fq
output: mapped/NA12878-1.sorted.bam
log: logs/bwa_mem/NA12878-1.log
jobid: 271
wildcards: sample=NA12878, unit=1
threads: 4
[Tue May 12 10:56:12 2020]
rule recalibrate_base_qualities:
input: mapped/NA12878-1.sorted.bam, /hpf/largeprojects/ccm_dccforge/dccdipg/Common/genomes/GRCh37d5/GRCh37d5.fa, /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/gemini_data/dbsnp.b147.20160601.tidy.vcf.gz
output: recal/NA12878-1.bam
log: logs/gatk/bqsr/NA12878-1.log
jobid: 270
wildcards: sample=NA12878, unit=1
[Tue May 12 10:56:12 2020]
rule call_variants:
input: recal/NA12878-1.bam, /hpf/largeprojects/ccm_dccforge/dccdipg/Common/genomes/GRCh37d5/GRCh37d5.fa, /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/gemini_data/dbsnp.b147.20160601.tidy.vcf.gz
output: called/NA12878.GL000204.1.g.vcf.gz
log: logs/gatk/haplotypecaller/NA12878.GL000204.1.log
jobid: 239
wildcards: sample=NA12878, contig=GL000204.1
...
- Run the pipeline
- as a single job using:
qsub dnaseq.pbs
. - as multi-node jobs using:
qsub dnaseq_cluster.pbs
. Change the value of variables defined inside the above file according to your system and make sure the following inpbs_profile/config.yaml
is set to absolute path.cluster-config: "/home/<username>/crg2/pbs_profile/pbs_config.yaml"
Referpbs_profile/cluster.md
document for detailed description.
The SNV report can be found in the directory: report/all/{PROJECT_ID}/{PROJECT_ID}.*.csv. The SV reports can be found in the directory: report/sv/{PROJECt_ID}.wgs.{VER}.{DATE}.tsv.
-
Map fastq's to the human decoy genome GrCh37d5
-
Picard MarkDuplicates, but don't remove reads
-
GATK4 base recalibration
-
Remove reads mapped to decoy chromosomes
-
Call SNV's and generate gVCFs
-
Merge gVCF's and perform joint genotyping
-
Filter against GATK best practices filters
-
Decompose multiallelics, sort and uniq the filtered VCF using vt
-
Annotate using vcfanno and VEP
-
Generate a gemini db using vcf2db.py
-
Generate a cre report using cre.sh
-
Call SV's using Manta, Smoove and Wham
-
Merge calls using MetaSV
-
Annotate VCF using snpEff and SVScores
-
Split multi-sample VCF into individual sample VCFs
-
Generate an annotated report using crg
Column descriptions and more info on how variants are filtered can be found here:
SNV: https://docs.google.com/document/d/1zL4QoINtkUd15a0AK4WzxXoTWp2MRcuQ9l_P9-xSlS4
SV: https://docs.google.com/document/d/1o870tr0rcshoae_VkG1ZOoWNSAmorCZlhHDpZuZogYE
The pipeline generates 4 reports:
-
wgs.snv - a report on coding SNVs across the entire genome
-
wgs.panel.snv - a report on SNVs within the panel specified bed file
-
wgs.panel.snv - a report on SNVs within the panel specified bed file with a 100kb flank on each side
-
wgs.sv - a report on SVs across the entire genome