-
Notifications
You must be signed in to change notification settings - Fork 56
Description
Hello,
I am trying to run SomaticSeq on RNA data (single-end reads) but it's really slow. It never finished because I had to kill it after 7 days. I have many large samples (~10 GB) and if the smallest one (3.0 GB) takes more than 7 days to finish, I can't use it like this. It could run for months. I’m using Kubernetes so I have enough computing capacity.
First, I used SplitNCigarReads tool (https://gatk.broadinstitute.org/hc/en-us/articles/360036858811-SplitNCigarReads) on my mapped RNA and after that, I ran variant callers (lofreq, mutect2, strelka, vardict, varscan). I ran SomaticSeq as the last with command:
somaticseq_parallel.py --threads 20 --output-directory somatic_varcalls/sample1 --genome-reference GRCh38-p10.fa --inclusion-region wgs.bed --minimum-num-callers 0.4 single --bam-file sample1.RNAsplit.bam --mutect2-vcf somatic_varcalls/sample1/MuTect2.vcf --vardict-vcf somatic_varcalls/sample1/VarDict.vcf --lofreq-vcf somatic_varcalls/sample1/Lofreq.vcf --strelka-vcf somatic_varcalls/sample1/variants.vcf.gz --varscan-vcf somatic_varcalls/sample1/VarScan2.vcf
I also tried to run SomaticSeq only with one variant caller. First, just with Vardict and it took 51 hours to finish. Second, just with Strelka and it took 32 hours to finish. I also tried to use a smaller bed file (only a few exome positions) but nothing changed.
My theory is that SomaticSeq has a problem when it encounters heavily covered reads because when it splits bed file for parallelization, some were counted fast but some took many hours or days. I think that maybe if I will do some subset of those heavily covered areas, It could help but I still don't know how to approach this.
Do you have any idea or advice on what can I do with it? I used SomaticSeq on DNA data many times before, so I know that normally it ran from a few minutes to a few hours.