Discrepancies in read depth of BAM files #110

gevro · 2024-12-06T21:24:22Z

Hi,
I encountered two related discrepancies:

I'm seeing LOWER read depths for every locus in the 'non-deduplicated' BAM file made by the bammarkduplicatesopt + bamaddreadbundles steps compared to the subsequent 'deduplicated BAM' file made by randomreadinbundle. But the former should have higher read depth than the latter.
Additionally, I'm seeing that the 'duplicate' flag is still set for many reads in the 'non-deduplicated' BAM file, even though bamaddreadbundles should have removed all duplicate reads. I think this might be the cause of the above, but not sure.

I have not been able to debug this.

Example (using the same settings as you use for bam2R in the snv_merge_and_vaf_calc script:
chr1 63265710:

coverage in the 'non-deduplicated BAM' = 2895
coverage in the 'deduplicated BAM' = 4564

Similar results for every other site.

As a sanity check, with bcftools mpileup with similar settings to bam2R shows the same pattern:

coverage in the 'non-deduplicated BAM' = 4036
coverage in the 'deduplicated BAM' = 7128

-> And if I include PCR/optical duplicate reads in the 'non-deduplicated BAM', the the coverage is much higher than 4036, which doesn't make sense, since all PCR/optical duplicate reads should have already been removed.

My best guess is that the PCR/optical duplicate marking is artifactual, and if I want to assess the read depth in the 'non-deduplicated' BAM file, then I should not filter PCR/optical duplicates.

Thanks!

fa8sanger · 2024-12-09T11:06:49Z

The coverage in the non-dedup bam should be equal or higher than the coverage in the dedup bam. You have some problem there, I’ve never seen that. To clarify, bamaddreadbundles doesn’t remove duplicates The result of randomreadinbundle should have no duplicates (check with samtools flagstat) On 6 Dec 2024, at 21:24, gevro ***@***.***> wrote: Hi, I encountered two related discrepancies: 1. I'm seeing LOWER read depths for every locus in the 'non-deduplicated' BAM file made by the bammarkduplicatesopt + bamaddreadbundles steps compared to the subsequent 'deduplicated BAM' file made by randomreadinbundle. But the former should have higher read depth than the latter. 2. Additionally, I'm seeing that the 'duplicate' flag is still set for many reads in the 'non-deduplicated' BAM file, even though bamaddreadbundles should have removed all duplicate reads. I think this might be the cause of the above, but not sure. I have not been able to debug this.

…

________________________________ Example (using the same settings as you use for bam2R in the snv_merge_and_vaf_calc script: chr1 63265710: * coverage in the 'non-deduplicated BAM' = 2895 * coverage in the 'deduplicated BAM' = 4564 Similar results for every other site. As a sanity check, with bcftools mpileup with similar settings to bam2R shows the same pattern: * coverage in the 'non-deduplicated BAM' = 4036 * coverage in the 'deduplicated BAM' = 7128 -> And if I include PCR/optical duplicate reads in the 'non-deduplicated BAM', the the coverage is much higher than 4036, which doesn't make sense, since all PCR/optical duplicate reads should have already been removed. Thanks! — Reply to this email directly, view it on GitHub [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_110&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=HjxuLCFBAVcG-y2ZbS2QFGoBXhZtKcX81c9KQynIpug8DA0wUxDq1Fu5Vs86xa_1&s=Ec9GCeuxOmrMrxfEoNNdq4bc4YV1kadXcfB6ZBMAjEE&e=>, or unsubscribe [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3KJFVSY4PMCDVXF5U32EIIZZAVCNFSM6AAAAABTFOECWSVHI2DSMVQWIX3LMV43ASLTON2WKOZSG4ZDGOJVGMZDGMQ&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=HjxuLCFBAVcG-y2ZbS2QFGoBXhZtKcX81c9KQynIpug8DA0wUxDq1Fu5Vs86xa_1&s=7zYJujzdbXkd0PcyMSJmiqYArPfufALqJ2SsHyzfT2I&e=>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

---------------------------------------------------------------------- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro · 2024-12-09T15:05:47Z

Regarding bamaddreadbundles, I thought it does filter optical duplicates per the documentation: "With bamaddreadbundles, optical duplicates and unpaired mates are filtered and the RB tag is created:" and I think I see it in the code of bamaddreadbundles.cc here:

if (((suppl + qcfail + unmapped + secondary + has_od) == 0) &&
      ((paired  + has_rc + has_mc + has_rb + has_mb) == 5)) {
    return true;
  }
  return false;

So after bamaddreadbundles that makes the non-deduplicated BAM files, there shouldn't be duplicates. But there seem to be.

Here is samtools flagstat for a non-deduplicated BAM file made by bamaddreadbundles, showing that there are still duplicates marked:

249561989 + 0 in total (QC-passed reads + QC-failed reads)
246436630 + 0 primary
0 + 0 secondary
3125359 + 0 supplementary
234311099 + 0 duplicates
234311099 + 0 primary duplicates
249349532 + 0 mapped (99.91% : N/A)
246224173 + 0 primary mapped (99.91% : N/A)
246436630 + 0 paired in sequencing
123218315 + 0 read1
123218315 + 0 read2
234362430 + 0 properly paired (95.10% : N/A)
246058992 + 0 with itself and mate mapped
165181 + 0 singletons (0.07% : N/A)
9763136 + 0 with mate mapped to a different chr
8564464 + 0 with mate mapped to a different chr (mapQ>=5)

The process we're doing to generate the non-deduplicated BAM is:
fastq -> extract_tags.py -> bwa mem -> bamsormadup -> bammarkduplicatesopt -> bamaddreadbundles

Then we run randomreadinbundle to generate the deduplicated BAM.

Indeed, there is much higher coverage in the non-deduplicated BAM compared to the deduplicated BAM when we ignore the samtools 1024 duplicate flag bit. But when we filter the non-deduplicated BAM to remove reads with the 1024 duplicate flag set, then the non-deduplicated BAM has lower coverage than the deduplicated BAM, even though technically the non-deduplicated BAM shouldn't have any reads marked as duplicates because bamaddreadbundles should have filtered out optical duplicates that are the only type of duplicate that has been marked.

I wonder if the issue has something to do with how the NanoSeq pipeline analyzes the optical duplicate marking of bammarkduplicatesopt.

From what I can tell bammarkduplicatesopt marks optical duplicates using the standard sam 1024 flag. But then bamaddreadbundles seems to be filtering out optical duplicates based on the presence of the 'od' tag, which is a different type of mark.

bamaddreadbundles defines the od tag as follows:
https://manpages.debian.org/unstable/biobambam2/bammarkduplicatesopt.1.en.html
"odtag=: aux field tag for storing non dup read pair name for optical duplicates"

I'm not positive how to interpret that od tag, but I'm not sure that filtering based on the presence of that tag necessarily removes all the optical duplicates?

Tracing the od tag code in biobambam2 takes me from here:
https://gitlab.com/german.tischler/biobambam2/-/blob/master/src/programs/bammarkduplicatesopt.cpp?ref_type=heads

to here:
https://gitlab.com/german.tischler/libmaus2/-/blob/master/src/libmaus2/bambam/DupMarkBase.hpp

But reading this code, it is still not clear to me what the od tag is, and how filtering reads that have the od tag set is different from filtering reads that have the 1024 SAM flag set.

Perhaps another explanation is that bammarkduplicatesopt does not correctly set the SAM 1024 flag (not sure why if so), and that filtering on the presence of the 'od' tag is the correct way to filter out optical duplicates.

fa8sanger · 2024-12-09T15:13:07Z

There’s some confusion here... The only duplicates bamaddreadbundles removes are the optical duplicates. It leaves all the good PCR duplicates there, they are the basis of duplex calling. 1024 is for PCR or optical duplicates. We don’t work with that, we work with the “od” tag. Since the bamaddreadbundle bam is >> than the randomreadinbundle bam, there’s no problem here, that’s how it has to be So much tutoring! On 9 Dec 2024, at 15:06, gevro ***@***.***> wrote: Regarding bamaddreadbundles, I thought it does filter optical duplicates per the documentation: "With bamaddreadbundles, optical duplicates and unpaired mates are filtered and the RB tag is created:" and I think I see it in the code of bamaddreadbundles.cc here: if (((suppl + qcfail + unmapped + secondary + has_od) == 0) && ((paired + has_rc + has_mc + has_rb + has_mb) == 5)) { return true; } return false; So after bamaddreadbundles that makes the neat BAM files, there shouldn't be duplicates. But there seem to be. Here is samtools flagstat for a non-deduplicated BAM file made by bamaddreadbundles, showing that there are still duplicates marked: 249561989 + 0 in total (QC-passed reads + QC-failed reads) 246436630 + 0 primary 0 + 0 secondary 3125359 + 0 supplementary 234311099 + 0 duplicates 234311099 + 0 primary duplicates 249349532 + 0 mapped (99.91% : N/A) 246224173 + 0 primary mapped (99.91% : N/A) 246436630 + 0 paired in sequencing 123218315 + 0 read1 123218315 + 0 read2 234362430 + 0 properly paired (95.10% : N/A) 246058992 + 0 with itself and mate mapped 165181 + 0 singletons (0.07% : N/A) 9763136 + 0 with mate mapped to a different chr 8564464 + 0 with mate mapped to a different chr (mapQ>=5) The process we're doing to generate the non-deduplicated BAM is: fastq -> extract_tags.py -> bwa mem -> bamsormadup -> bammarkduplicatesopt -> bamaddreadbundles Then we run randomreadinbundle to generate the deduplicated BAM. Indeed, there is much higher coverage in the non-deduplicated BAM compared to the deduplicated BAM when we ignore the samtools 1024 duplicate flag bit. But when we filter the non-deduplicated BAM to remove reads with the 1024 duplicate flag set, then the non-deduplicated BAM has lower coverage than the deduplicated BAM, even though technically the non-deduplicated BAM shouldn't have any reads marked as duplicates because bamaddreadbundles should have filtered out optical duplicates that are the only type of duplicate that has been marked. I wonder if the issue has something to do with how the NanoSeq pipeline analyzes the optical duplicate marking of bammarkduplicatesopt. From what I can tell bammarkduplicatesopt marks optical duplicates using the standard sam 1024 flag. But then bamaddreadbundles seems to be filtering out optical duplicates based on the presence of the 'od' tag, which is a different type of mark. bamaddreadbundles defines the od tag as follows: https://manpages.debian.org/unstable/biobambam2/bammarkduplicatesopt.1.en.html [manpages.debian.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__manpages.debian.org_unstable_biobambam2_bammarkduplicatesopt.1.en.html&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=jYKzvIA78BfQvvjySmZOGf8KbfeGFjh7obE6pg_tJDlNkfICMiXCVVa7LwEo70w1&s=-R_zhIAIPgfVXS1yS-hluyT4EQU225G2AYnMciMMteU&e=> "odtag=: aux field tag for storing non dup read pair name for optical duplicates" I'm not positive how to interpret that od tag, but I'm not sure that filtering based on the presence of that tag necessarily removes all the optical duplicates? Tracing the od tag code in biobambam2 takes me from here: https://gitlab.com/german.tischler/biobambam2/-/blob/master/src/programs/bammarkduplicatesopt.cpp?ref_type=heads [gitlab.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.com_german.tischler_biobambam2_-2D_blob_master_src_programs_bammarkduplicatesopt.cpp-3Fref-5Ftype-3Dheads&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=jYKzvIA78BfQvvjySmZOGf8KbfeGFjh7obE6pg_tJDlNkfICMiXCVVa7LwEo70w1&s=aNnZ9Ik7MIAgA6TCCW8WtGshMYtY4e2ioCf8g7bxpNc&e=> to here: https://gitlab.com/german.tischler/libmaus2/-/blob/master/src/libmaus2/bambam/DupMarkBase.hpp [gitlab.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.com_german.tischler_libmaus2_-2D_blob_master_src_libmaus2_bambam_DupMarkBase.hpp&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=jYKzvIA78BfQvvjySmZOGf8KbfeGFjh7obE6pg_tJDlNkfICMiXCVVa7LwEo70w1&s=BQlTo5_w2zFw68qcoQ0D4IVmyvMV1y-_1yVTu2etJrU&e=> But reading this code, it is still not clear to me what the od tag is, and how filtering reads that have the od tag set is different from filtering reads that have the 1024 SAM flag set. — Reply to this email directly, view it on GitHub [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_110-23issuecomment-2D2528266871&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=jYKzvIA78BfQvvjySmZOGf8KbfeGFjh7obE6pg_tJDlNkfICMiXCVVa7LwEo70w1&s=bpzaYXhFpSEiXTvN4EIrAWsbTmJuEYyJcCckel9LkU0&e=>, or unsubscribe [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3KUOUMKNDCLZW3Z3U32EWWWBAVCNFSM6AAAAABTFOECWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRYGI3DMOBXGE&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=jYKzvIA78BfQvvjySmZOGf8KbfeGFjh7obE6pg_tJDlNkfICMiXCVVa7LwEo70w1&s=mRGhiUNy9eCQjiC_8V_WWWnNRXFstA_2AJrtf-ftnMk&e=>. You are receiving this because you commented.Message ID: ***@***.***>

…

---------------------------------------------------------------------- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro · 2024-12-09T15:39:18Z

Thanks for your help! I see now the source of my confusion: I thought 'PCR duplicate' marking for duplex analysis was performed later in the pipeline. But what you do is use the duplicate marks made by bamsormadup and then you distinguish optical duplicates from PCR duplicates using the od tag.

Looks like bamsormadup also marks optical duplicates with a default distance of 100, so I'm assuming bammarkduplicatesopt when finding optical duplicates ignores the prior duplicate marking of bamsormadup.

I see now also your nextflow pipeline updated bamsormadup to optminpixeldif=10 from the default of 100 -- wasn't sure the reason for that?

bamsormadup inputformat=sam level=0 blocksortverbose=0 rcsupport=1 threads=$task.cpus fragmergepar=$task.cpus optminpixeldif=10 | \\
            bammarkduplicatesopt verbose=0 level=0 index=0 optminpixeldif=2500

fa8sanger · 2024-12-09T15:44:15Z

The reason to increase optminpixeldif to 2500 is because in current flowcells you can find the same molecule in further cells than with the older HiSeq 1000/2500. These technically are not “optical” duplicates, but in practice it’s the same, and we don’t want them. On 9 Dec 2024, at 15:39, gevro ***@***.***> wrote: Thanks for your help! I see now the source of my confusion: I thought 'PCR duplicate' marking for duplex analysis was performed later in the pipeline. But what you do is use the duplicate marks made by bamsormadup and then you distinguish optical duplicates from PCR duplicates using the od tag. Looks like bamsormadup also marks optical duplicates with a default distance of 100, so I'm assuming bammarkduplicatesopt when finding optical duplicates ignores the prior duplicate marking of bamsormadup. I see now also your nextflow pipeline updated bamsormadup to optminpixeldif=10 from the default of 100 -- wasn't sure the reason for that? bamsormadup inputformat=sam level=0 blocksortverbose=0 rcsupport=1 threads=$task.cpus fragmergepar=$task.cpus optminpixeldif=10 | \\ bammarkduplicatesopt verbose=0 level=0 index=0 optminpixeldif=2500 — Reply to this email directly, view it on GitHub [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_110-23issuecomment-2D2528420048&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=sr6SJopsLGJxWj_T5WMkwVjnKG3vluSylG_0L0GRCm0eJx4kQpGs0wvpDnkXvvFB&s=Gec66bH8OlZuu_IBi_OjnOA9e5svYT6wXTaMU75X9eE&e=>, or unsubscribe [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3IDY3SE7L7RT4DFYV32EW2T3AVCNFSM6AAAAABTFOECWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRYGQZDAMBUHA&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=sr6SJopsLGJxWj_T5WMkwVjnKG3vluSylG_0L0GRCm0eJx4kQpGs0wvpDnkXvvFB&s=_sna82xRPSJOxXUCIT5hUWhM8aa-BFkYMuCRY9Boocc&e=>. You are receiving this because you commented.Message ID: ***@***.***>

…

---------------------------------------------------------------------- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro · 2024-12-09T15:52:43Z

Yes, understood that 2500 is best for the latest Novaseq sequencers, but the part I wasn't sure about was why the preceding bamsormadup is run with optminpixeldif=10 instead of the default (optminpixeldif=100).

I would have thought it wouldn't matter what optminpixeldif value is used by bamsormadup since bammarkduplicatesopt is run afterwards.

Thanks!

fa8sanger · 2024-12-09T15:59:02Z

I see why that is confusing The first bamsormadup is run to add rc and mc tags. For some reason the code shows “mapped_od.bam” as the output, but the intended result is not OD marking. I think the first program is run on a name sorted bam while bammarkduplicatesopt is run on a coordinate sorted bam.

…

---------------------------------------------------------------------- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro · 2024-12-09T17:04:34Z

Got it. So I assume the parameter optminpixeldif=10 is not really important to specify when running bamsormadup, since bammarkduplicatesopt redoes optical duplicate marking anyway? I wasn't sure why you all added optminpixeldif=10 to bamsormadup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancies in read depth of BAM files #110

Discrepancies in read depth of BAM files #110

gevro commented Dec 6, 2024 •

edited

Loading

fa8sanger commented Dec 9, 2024 via email

gevro commented Dec 9, 2024 •

edited

Loading

fa8sanger commented Dec 9, 2024 via email

gevro commented Dec 9, 2024 •

edited

Loading

fa8sanger commented Dec 9, 2024 via email

gevro commented Dec 9, 2024

fa8sanger commented Dec 9, 2024 via email

gevro commented Dec 9, 2024

Discrepancies in read depth of BAM files #110

Discrepancies in read depth of BAM files #110

Comments

gevro commented Dec 6, 2024 • edited Loading

fa8sanger commented Dec 9, 2024 via email

gevro commented Dec 9, 2024 • edited Loading

fa8sanger commented Dec 9, 2024 via email

gevro commented Dec 9, 2024 • edited Loading

fa8sanger commented Dec 9, 2024 via email

gevro commented Dec 9, 2024

fa8sanger commented Dec 9, 2024 via email

gevro commented Dec 9, 2024

gevro commented Dec 6, 2024 •

edited

Loading

gevro commented Dec 9, 2024 •

edited

Loading

gevro commented Dec 9, 2024 •

edited

Loading