You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/Manual.tex
+46-32Lines changed: 46 additions & 32 deletions
Original file line number
Diff line number
Diff line change
@@ -69,11 +69,11 @@
69
69
70
70
\section{Introduction}
71
71
72
-
SomaticSeq is a flexible post-somatic-mutation-calling workflow for improved accuracy. We have incorporated multiple somatic mutation caller(s) to obtain a combined call set, and then it uses machine learning to distinguish true mutations from false positives from that call set. We have incorporated the following somatic mutation caller: MuTect/Indelocator, VarScan2, JointSNVMix, SomaticSniper, VarDict, MuSE, LoFreq, and Scalpel. You may incorporate some or all of those callers into your own pipeline with SomaticSeq.
72
+
SomaticSeq is a flexible post-somatic-mutation-calling workflow for improved accuracy. We have incorporated multiple somatic mutation caller(s) to obtain a combined call set, and then it uses machine learning to distinguish true mutations from false positives from that call set. We have incorporated the following somatic mutation caller: MuTect/Indelocator, MuTect2, VarScan2, JointSNVMix, SomaticSniper, VarDict, MuSE, LoFreq, and Scalpel. You may incorporate some or all of those callers into your own pipeline with SomaticSeq.
73
73
74
74
The manuscript, An ensemble approach to accurately detect somatic mutations using SomaticSeq, is published in \href{http://dx.doi.org/10.1186/s13059-015-0758-2}{Genome Biology 2015, 16:197}. The SomaticSeq project is located at \href{http://bioinform.github.io/somaticseq/}{\textit{http://bioinform.github.io/somaticseq/}}. The data described in the manuscript is also described at \href{http://bioinform.github.io/somaticseq/data.html}{\textit{http://bioinform.github.io/somaticseq/data.html}}. There have been some major improvements since the publication.
75
75
76
-
SomaticSeq.Wrapper.sh is a bash script that calls a series of scripts to combine the output of the somatic mutation caller(s), after the somatic mutation callers are run. Then, depending on what R scripts are fed to SomaticSeq.Wrapper.sh, it will either 1) train the call set into a classifier, 2) predict high-confidence somatic mutations from the call set based on a pre-defined classifier, or 3) simply label the calls (i.e., PASS or REJECT) based on majority vote of the tools.
76
+
SomaticSeq.Wrapper.sh is a bash script that calls a series of scripts to combine the output of the somatic mutation caller(s), after the somatic mutation callers are run. Then, depending on what R scripts are fed to SomaticSeq.Wrapper.sh, it will either 1) train the call set into a classifier, 2) predict high-confidence somatic mutations from the call set based on a pre-defined classifier, or 3) simply label the calls (i.e., PASS, LowQual, or REJECT) based on majority vote of the tools.
77
77
78
78
\subsection{Dependencies}
79
79
@@ -95,7 +95,7 @@ \subsection{Dependencies}
95
95
Optional: dbSNP and COSMIC files in VCF format (if you want to use these features as a part of the training).
96
96
97
97
\item
98
-
At least one of MuTect/Indelocator, VarScan2, JointSNVMix, SomaticSniper, VarDict, MuSE, LoFreq, and/or Scalpel. Those are the tools we have incorporated in SomaticSeq. If there are other somatic tools that may be good addition to our list, please make the suggestion to us.
98
+
At least one of MuTect/Indelocator, MuTect2, VarScan2, JointSNVMix, SomaticSniper, VarDict, MuSE, LoFreq, and/or Scalpel. Those are the tools we have incorporated in SomaticSeq. If there are other somatic tools that may be good addition to our list, please make the suggestion to us.
99
99
100
100
\end{itemize}
101
101
@@ -117,30 +117,32 @@ \subsection{To train data set into a classifier}
117
117
# For training, truth file and the correct R script are required.
118
118
119
119
SomaticSeq.Wrapper.sh \
120
-
--mutect MuTect/variants.snp.vcf \
121
-
--mutect2 MuTect2/variants.vcf \
122
-
--indelocator Indelocator/variants.indel.vcf \
123
-
--varscan-snv VarScan2/variants.snp.vcf \
124
-
--varscan-indel VarScan2/variants.indel.vcf \
125
-
--jsm JointSNVMix2/variants.snp.vcf \
126
-
--sniper SomaticSniper/variants.snp.vcf \
127
-
--vardict VarDict/variants.vcf \
128
-
--muse MuSE/variants.snp.vcf \
129
-
--lofreq-snv LoFreq/variants.snp.vcf \
130
-
--lofreq-indel LoFreq/variants.indel.vcf \
131
-
--scalpel Scalpel/variants.indel.vcf \
132
-
--normal-bam matched_normal.bam \
133
-
--tumor-bam tumor.bam \
134
-
--ada-r-script ada_model_builder.R \
135
-
--genome-reference human_b37.fasta \
136
-
--cosmic cosmic.b37.v71.vcf \
137
-
--dbsnp dbSNP.b37.v141.vcf \
138
-
--gatk $PATH/TO/GenomeAnalysisTK.jar \
139
-
--exclusion-region ignore.bed \
140
-
--inclusion-region validated.bed
141
-
--truth-snv truth.snp.vcf \
142
-
--truth-indel truth.indel.vcf \
143
-
--output-dir $OUTPUT_DIR
120
+
--mutect MuTect/variants.snp.vcf \
121
+
--mutect2 MuTect2/variants.vcf \
122
+
--indelocator Indelocator/variants.indel.vcf \
123
+
--varscan-snv VarScan2/variants.snp.vcf \
124
+
--varscan-indel VarScan2/variants.indel.vcf \
125
+
--jsm JointSNVMix2/variants.snp.vcf \
126
+
--sniper SomaticSniper/variants.snp.vcf \
127
+
--vardict VarDict/variants.vcf \
128
+
--muse MuSE/variants.snp.vcf \
129
+
--lofreq-snv LoFreq/variants.snp.vcf \
130
+
--lofreq-indel LoFreq/variants.indel.vcf \
131
+
--scalpel Scalpel/variants.indel.vcf \
132
+
--normal-bam matched_normal.bam \
133
+
--tumor-bam tumor.bam \
134
+
--ada-r-script ada_model_builder.R \
135
+
--genome-reference human_b37.fasta \
136
+
--cosmic cosmic.b37.v71.vcf \
137
+
--dbsnp dbSNP.b37.v141.vcf \
138
+
--gatk $PATH/TO/GenomeAnalysisTK.jar \
139
+
--exclusion-region ignore.bed \
140
+
--inclusion-region validated.bed
141
+
--truth-snv truth.snp.vcf \
142
+
--truth-indel truth.indel.vcf \
143
+
--pass-threshold 0.5 \
144
+
--lowqual-threshold 0.1 \
145
+
--output-dir $OUTPUT_DIR
144
146
\end{lstlisting}
145
147
146
148
SomaticSeq.Wrapper.sh supports any combination of the somatic mutation callers we have incorporated into the workflow. SomaticSeq will run based on the output VCFs you have provided. It will train for SNV and/or INDEL if you provide the truth.snp.vcf and/or truth.indel.vcf file(s) as well as the proper R script (ada\_model\_builder.R). Otherwise, it will fall back to the simple caller consensus mode.
We use GATK CombineVariants to combine the VCF files from different callers, although it does not matter what tools are used to merge VCF files. We GATK CombineVariants because it's quite fast. To make them compatible with GATK, the VCF files are modified.
208
+
We use GATK CombineVariants to combine the VCF files fkeep-intermediates:rom different callers, although it does not matter what tools are used to merge VCF files. We GATK CombineVariants because it's quite fast. To make them compatible with GATK, the VCF files are modified.
207
209
208
210
A simple alternative method is to use GNU sort and uniq in Linux to list all the unique first-5-column (i.e., CHROM, POS, ID, REF, and ALT) from all the VCF files, and then fill the remaining required VCF columns with whatever string and sort it according to the reference. That VCF file will do the job just fine.
209
211
@@ -552,22 +554,35 @@ \subsection{Version 2.2.1}
552
554
InDel\_3bp now stands for indel counts within 3 bps of the variant site, instead of exactly 3 bps from the variant site as it was previously (likewise for InDel\_2bp).
553
555
554
556
\item
555
-
Collapse MQ0 (mapping quality of 0) reads supporting reference/variant reads into a single metric of MQ00 reads (i.e., tBAM\_MQ0 and nBAM\_MQ0). From experience, the number of MQ00 reads is at least equally predictive of false positive calls, rather than distinguishing if those MQ0 reads support reference or variant.
557
+
Collapse MQ0 (mapping quality of 0) reads supporting reference/variant reads into a single metric of MQ0 reads (i.e., tBAM\_MQ0 and nBAM\_MQ0). From experience, the number of MQ0 reads is at least equally predictive of false positive calls, rather than distinguishing if those MQ0 reads support reference or variant.
556
558
557
559
\item
558
560
Obtain SOR (Somatic Odds Ratio) from BAM files instead of VarDict's VCF file.
559
561
560
562
\item
561
563
Fixed a typo in the SomaticSeq.Wrapper.sh script that did not handle inclusion region correctly.
562
564
565
+
\end{itemize}
566
+
567
+
563
568
569
+
\subsection{Version 2.2.2}
570
+
571
+
\begin{itemize}
572
+
573
+
\item
574
+
Got around an occasional unexplained issue in then ada package were the SOR is sometimes categorized as type, by forcing it to be numeric.
575
+
576
+
\item
577
+
Defaults PASS score from 0.7 to 0.5, and make them tunable in the SomaticSeq.Wrapper.sh script (--pass-threshold and --lowqual-threshold).
For suggestions, bug reports, or technical support, please post in the \href{https://github.com/bioinform/somaticseq/issues}{github issues} page, or email \href{mailto:[email protected]}{li\_[email protected]}.
591
-
605
+
For suggestions, bug reports, or technical support, please post in the \href{https://github.com/bioinform/somaticseq/issues}{github issues} page. The developers are alerted when issues are created there.
0 commit comments