Skip to content

Commit 52ee2f1

Browse files
author
Li Tai Fang
committed
docs, and allow configurable pass/lowqual threshold in wrapper script
1 parent bd84b52 commit 52ee2f1

File tree

4 files changed

+127
-49
lines changed

4 files changed

+127
-49
lines changed

SomaticSeq.Wrapper.sh

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
set -e
55

6-
OPTS=`getopt -o o:M:m:I:V:v:J:S:D:U:L:l:p:g:c:d:s:G:T:N:C:x:R:e:i:z:Z:k: --long output-dir:,mutect:,mutect2:,indelocator:,varscan-snv:,varscan-indel:,jsm:,sniper:,vardict:,muse:,lofreq-snv:,lofreq-indel:,scalpel:,genome-reference:,cosmic:,dbsnp:,snpeff-dir:,gatk:,tumor-bam:,normal-bam:,classifier-snv:,classifier-indel:,ada-r-script:,exclusion-region:,inclusion-region:,truth-indel:,truth-snv:,keep-intermediates: -n 'SomaticSeq.Wrapper.sh' -- "$@"`
6+
OPTS=`getopt -o o:M:m:I:V:v:J:S:D:U:L:l:p:g:c:d:s:G:T:N:C:x:R:e:i:z:Z:k: --long output-dir:,mutect:,mutect2:,indelocator:,varscan-snv:,varscan-indel:,jsm:,sniper:,vardict:,muse:,lofreq-snv:,lofreq-indel:,scalpel:,genome-reference:,cosmic:,dbsnp:,snpeff-dir:,gatk:,tumor-bam:,normal-bam:,classifier-snv:,classifier-indel:,ada-r-script:,exclusion-region:,inclusion-region:,truth-indel:,truth-snv:,pass-threshold:,lowqual-threshold:,keep-intermediates: -n 'SomaticSeq.Wrapper.sh' -- "$@"`
77

88
if [ $? != 0 ] ; then echo "Failed parsing options." >&2 ; exit 1 ; fi
99

@@ -16,6 +16,8 @@ PATH=/net/kodiak/volumes/lake/shared/opt/python3/bin:/home/ltfang/apps/bedtools-
1616
MYDIR="$( cd "$( dirname "$0" )" && pwd )"
1717

1818
keep_intermediates=0
19+
pass_threshold=0.5
20+
lowqual_threshold=0.1
1921

2022
while true; do
2123
case "$1" in
@@ -181,6 +183,18 @@ while true; do
181183
*) snpgroundtruth=$2 ; shift 2 ;;
182184
esac ;;
183185

186+
--pass-threshold )
187+
case "$2" in
188+
"") shift 2 ;;
189+
*) pass_threshold=$2 ; shift 2 ;;
190+
esac ;;
191+
192+
--lowqual-threshold )
193+
case "$2" in
194+
"") shift 2 ;;
195+
*) lowqual_threshold=$2 ; shift 2 ;;
196+
esac ;;
197+
184198
-k | --keep-intermediates )
185199
case "$2" in
186200
"") shift 2 ;;
@@ -436,7 +450,7 @@ then
436450
# If a classifier is used, assume predictor.R, and do the prediction routine:
437451
if [[ -r ${snpclassifier} ]] && [[ -r ${ada_r_script} ]]; then
438452
R --no-save --args "$snpclassifier" "${merged_dir}/Ensemble.sSNV.tsv" "${merged_dir}/Trained.sSNV.tsv" < "$ada_r_script"
439-
$MYDIR/SSeq_tsv2vcf.py -tsv ${merged_dir}/Trained.sSNV.tsv -vcf ${merged_dir}/Trained.sSNV.vcf -pass 0.5 -low 0.1 -all -phred -tools $tool_mutect $tool_varscan $tool_jsm $tool_sniper $tool_vardict $tool_muse $tool_lofreq
453+
$MYDIR/SSeq_tsv2vcf.py -tsv ${merged_dir}/Trained.sSNV.tsv -vcf ${merged_dir}/Trained.sSNV.vcf -pass $pass_threshold -low $lowqual_threshold -all -phred -tools $tool_mutect $tool_varscan $tool_jsm $tool_sniper $tool_vardict $tool_muse $tool_lofreq
440454

441455
# If ground truth is here, assume builder.R, and build a classifier
442456
elif [[ -r ${snpgroundtruth} ]] && [[ -r ${ada_r_script} ]]; then
@@ -556,7 +570,7 @@ then
556570
# If a classifier is used, use it:
557571
if [[ -r ${indelclassifier} ]] && [[ -r ${ada_r_script} ]]; then
558572
R --no-save --args "$indelclassifier" "${merged_dir}/Ensemble.sINDEL.tsv" "${merged_dir}/Trained.sINDEL.tsv" < "$ada_r_script"
559-
$MYDIR/SSeq_tsv2vcf.py -tsv ${merged_dir}/Trained.sINDEL.tsv -vcf ${merged_dir}/Trained.sINDEL.vcf -pass 0.5 -low 0.1 -all -phred -tools $tool_indelocator $tool_varscan $tool_vardict $tool_lofreq $tool_scalpel
573+
$MYDIR/SSeq_tsv2vcf.py -tsv ${merged_dir}/Trained.sINDEL.tsv -vcf ${merged_dir}/Trained.sINDEL.vcf -pass $pass_threshold -low $lowqual_threshold -all -phred -tools $tool_indelocator $tool_varscan $tool_vardict $tool_lofreq $tool_scalpel
560574

561575
# If ground truth is here, assume builder.R, and build a classifier
562576
elif [[ -r ${indelgroundtruth} ]] && [[ -r ${ada_r_script} ]]; then

docs/Manual.pdf

1 KB
Binary file not shown.

docs/Manual.tex

Lines changed: 46 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -69,11 +69,11 @@
6969

7070
\section{Introduction}
7171

72-
SomaticSeq is a flexible post-somatic-mutation-calling workflow for improved accuracy. We have incorporated multiple somatic mutation caller(s) to obtain a combined call set, and then it uses machine learning to distinguish true mutations from false positives from that call set. We have incorporated the following somatic mutation caller: MuTect/Indelocator, VarScan2, JointSNVMix, SomaticSniper, VarDict, MuSE, LoFreq, and Scalpel. You may incorporate some or all of those callers into your own pipeline with SomaticSeq.
72+
SomaticSeq is a flexible post-somatic-mutation-calling workflow for improved accuracy. We have incorporated multiple somatic mutation caller(s) to obtain a combined call set, and then it uses machine learning to distinguish true mutations from false positives from that call set. We have incorporated the following somatic mutation caller: MuTect/Indelocator, MuTect2, VarScan2, JointSNVMix, SomaticSniper, VarDict, MuSE, LoFreq, and Scalpel. You may incorporate some or all of those callers into your own pipeline with SomaticSeq.
7373

7474
The manuscript, An ensemble approach to accurately detect somatic mutations using SomaticSeq, is published in \href{http://dx.doi.org/10.1186/s13059-015-0758-2}{Genome Biology 2015, 16:197}. The SomaticSeq project is located at \href{http://bioinform.github.io/somaticseq/}{\textit{http://bioinform.github.io/somaticseq/}}. The data described in the manuscript is also described at \href{http://bioinform.github.io/somaticseq/data.html}{\textit{http://bioinform.github.io/somaticseq/data.html}}. There have been some major improvements since the publication.
7575

76-
SomaticSeq.Wrapper.sh is a bash script that calls a series of scripts to combine the output of the somatic mutation caller(s), after the somatic mutation callers are run. Then, depending on what R scripts are fed to SomaticSeq.Wrapper.sh, it will either 1) train the call set into a classifier, 2) predict high-confidence somatic mutations from the call set based on a pre-defined classifier, or 3) simply label the calls (i.e., PASS or REJECT) based on majority vote of the tools.
76+
SomaticSeq.Wrapper.sh is a bash script that calls a series of scripts to combine the output of the somatic mutation caller(s), after the somatic mutation callers are run. Then, depending on what R scripts are fed to SomaticSeq.Wrapper.sh, it will either 1) train the call set into a classifier, 2) predict high-confidence somatic mutations from the call set based on a pre-defined classifier, or 3) simply label the calls (i.e., PASS, LowQual, or REJECT) based on majority vote of the tools.
7777

7878
\subsection{Dependencies}
7979

@@ -95,7 +95,7 @@ \subsection{Dependencies}
9595
Optional: dbSNP and COSMIC files in VCF format (if you want to use these features as a part of the training).
9696

9797
\item
98-
At least one of MuTect/Indelocator, VarScan2, JointSNVMix, SomaticSniper, VarDict, MuSE, LoFreq, and/or Scalpel. Those are the tools we have incorporated in SomaticSeq. If there are other somatic tools that may be good addition to our list, please make the suggestion to us.
98+
At least one of MuTect/Indelocator, MuTect2, VarScan2, JointSNVMix, SomaticSniper, VarDict, MuSE, LoFreq, and/or Scalpel. Those are the tools we have incorporated in SomaticSeq. If there are other somatic tools that may be good addition to our list, please make the suggestion to us.
9999

100100
\end{itemize}
101101

@@ -117,30 +117,32 @@ \subsection{To train data set into a classifier}
117117
# For training, truth file and the correct R script are required.
118118

119119
SomaticSeq.Wrapper.sh \
120-
--mutect MuTect/variants.snp.vcf \
121-
--mutect2 MuTect2/variants.vcf \
122-
--indelocator Indelocator/variants.indel.vcf \
123-
--varscan-snv VarScan2/variants.snp.vcf \
124-
--varscan-indel VarScan2/variants.indel.vcf \
125-
--jsm JointSNVMix2/variants.snp.vcf \
126-
--sniper SomaticSniper/variants.snp.vcf \
127-
--vardict VarDict/variants.vcf \
128-
--muse MuSE/variants.snp.vcf \
129-
--lofreq-snv LoFreq/variants.snp.vcf \
130-
--lofreq-indel LoFreq/variants.indel.vcf \
131-
--scalpel Scalpel/variants.indel.vcf \
132-
--normal-bam matched_normal.bam \
133-
--tumor-bam tumor.bam \
134-
--ada-r-script ada_model_builder.R \
135-
--genome-reference human_b37.fasta \
136-
--cosmic cosmic.b37.v71.vcf \
137-
--dbsnp dbSNP.b37.v141.vcf \
138-
--gatk $PATH/TO/GenomeAnalysisTK.jar \
139-
--exclusion-region ignore.bed \
140-
--inclusion-region validated.bed
141-
--truth-snv truth.snp.vcf \
142-
--truth-indel truth.indel.vcf \
143-
--output-dir $OUTPUT_DIR
120+
--mutect MuTect/variants.snp.vcf \
121+
--mutect2 MuTect2/variants.vcf \
122+
--indelocator Indelocator/variants.indel.vcf \
123+
--varscan-snv VarScan2/variants.snp.vcf \
124+
--varscan-indel VarScan2/variants.indel.vcf \
125+
--jsm JointSNVMix2/variants.snp.vcf \
126+
--sniper SomaticSniper/variants.snp.vcf \
127+
--vardict VarDict/variants.vcf \
128+
--muse MuSE/variants.snp.vcf \
129+
--lofreq-snv LoFreq/variants.snp.vcf \
130+
--lofreq-indel LoFreq/variants.indel.vcf \
131+
--scalpel Scalpel/variants.indel.vcf \
132+
--normal-bam matched_normal.bam \
133+
--tumor-bam tumor.bam \
134+
--ada-r-script ada_model_builder.R \
135+
--genome-reference human_b37.fasta \
136+
--cosmic cosmic.b37.v71.vcf \
137+
--dbsnp dbSNP.b37.v141.vcf \
138+
--gatk $PATH/TO/GenomeAnalysisTK.jar \
139+
--exclusion-region ignore.bed \
140+
--inclusion-region validated.bed
141+
--truth-snv truth.snp.vcf \
142+
--truth-indel truth.indel.vcf \
143+
--pass-threshold 0.5 \
144+
--lowqual-threshold 0.1 \
145+
--output-dir $OUTPUT_DIR
144146
\end{lstlisting}
145147

146148
SomaticSeq.Wrapper.sh supports any combination of the somatic mutation callers we have incorporated into the workflow. SomaticSeq will run based on the output VCFs you have provided. It will train for SNV and/or INDEL if you provide the truth.snp.vcf and/or truth.indel.vcf file(s) as well as the proper R script (ada\_model\_builder.R). Otherwise, it will fall back to the simple caller consensus mode.
@@ -203,7 +205,7 @@ \section{The step-by-step SomaticSeq Workflow}
203205

204206

205207
\subsection{Combine the call sets}
206-
We use GATK CombineVariants to combine the VCF files from different callers, although it does not matter what tools are used to merge VCF files. We GATK CombineVariants because it's quite fast. To make them compatible with GATK, the VCF files are modified.
208+
We use GATK CombineVariants to combine the VCF files fkeep-intermediates:rom different callers, although it does not matter what tools are used to merge VCF files. We GATK CombineVariants because it's quite fast. To make them compatible with GATK, the VCF files are modified.
207209

208210
A simple alternative method is to use GNU sort and uniq in Linux to list all the unique first-5-column (i.e., CHROM, POS, ID, REF, and ALT) from all the VCF files, and then fill the remaining required VCF columns with whatever string and sort it according to the reference. That VCF file will do the job just fine.
209211

@@ -552,22 +554,35 @@ \subsection{Version 2.2.1}
552554
InDel\_3bp now stands for indel counts within 3 bps of the variant site, instead of exactly 3 bps from the variant site as it was previously (likewise for InDel\_2bp).
553555

554556
\item
555-
Collapse MQ0 (mapping quality of 0) reads supporting reference/variant reads into a single metric of MQ00 reads (i.e., tBAM\_MQ0 and nBAM\_MQ0). From experience, the number of MQ00 reads is at least equally predictive of false positive calls, rather than distinguishing if those MQ0 reads support reference or variant.
557+
Collapse MQ0 (mapping quality of 0) reads supporting reference/variant reads into a single metric of MQ0 reads (i.e., tBAM\_MQ0 and nBAM\_MQ0). From experience, the number of MQ0 reads is at least equally predictive of false positive calls, rather than distinguishing if those MQ0 reads support reference or variant.
556558

557559
\item
558560
Obtain SOR (Somatic Odds Ratio) from BAM files instead of VarDict's VCF file.
559561

560562
\item
561563
Fixed a typo in the SomaticSeq.Wrapper.sh script that did not handle inclusion region correctly.
562564

565+
\end{itemize}
566+
567+
563568

569+
\subsection{Version 2.2.2}
570+
571+
\begin{itemize}
572+
573+
\item
574+
Got around an occasional unexplained issue in then ada package were the SOR is sometimes categorized as type, by forcing it to be numeric.
575+
576+
\item
577+
Defaults PASS score from 0.7 to 0.5, and make them tunable in the SomaticSeq.Wrapper.sh script (--pass-threshold and --lowqual-threshold).
578+
564579
\end{itemize}
565580

566581

567582

568583

569584

570-
\section{To do: planned improvement}
585+
\section{Future development}
571586

572587
\begin{itemize}
573588

@@ -587,8 +602,7 @@ \section{To do: planned improvement}
587602

588603

589604
\section{Contact Us}
590-
For suggestions, bug reports, or technical support, please post in the \href{https://github.com/bioinform/somaticseq/issues}{github issues} page, or email \href{mailto:[email protected]}{li\_[email protected]}.
591-
605+
For suggestions, bug reports, or technical support, please post in the \href{https://github.com/bioinform/somaticseq/issues}{github issues} page. The developers are alerted when issues are created there.
592606

593607
\end{sloppypar}
594608
\end{document}

0 commit comments

Comments
 (0)