Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies in read depth of BAM files #110

Open
gevro opened this issue Dec 6, 2024 · 8 comments
Open

Discrepancies in read depth of BAM files #110

gevro opened this issue Dec 6, 2024 · 8 comments

Comments

@gevro
Copy link

gevro commented Dec 6, 2024

Hi,
I encountered two related discrepancies:

  1. I'm seeing LOWER read depths for every locus in the 'non-deduplicated' BAM file made by the bammarkduplicatesopt + bamaddreadbundles steps compared to the subsequent 'deduplicated BAM' file made by randomreadinbundle. But the former should have higher read depth than the latter.

  2. Additionally, I'm seeing that the 'duplicate' flag is still set for many reads in the 'non-deduplicated' BAM file, even though bamaddreadbundles should have removed all duplicate reads. I think this might be the cause of the above, but not sure.

I have not been able to debug this.


Example (using the same settings as you use for bam2R in the snv_merge_and_vaf_calc script:
chr1 63265710:

  • coverage in the 'non-deduplicated BAM' = 2895
  • coverage in the 'deduplicated BAM' = 4564

Similar results for every other site.

As a sanity check, with bcftools mpileup with similar settings to bam2R shows the same pattern:

  • coverage in the 'non-deduplicated BAM' = 4036
  • coverage in the 'deduplicated BAM' = 7128

-> And if I include PCR/optical duplicate reads in the 'non-deduplicated BAM', the the coverage is much higher than 4036, which doesn't make sense, since all PCR/optical duplicate reads should have already been removed.


My best guess is that the PCR/optical duplicate marking is artifactual, and if I want to assess the read depth in the 'non-deduplicated' BAM file, then I should not filter PCR/optical duplicates.

Thanks!

@fa8sanger
Copy link
Collaborator

fa8sanger commented Dec 9, 2024 via email

@gevro
Copy link
Author

gevro commented Dec 9, 2024

Regarding bamaddreadbundles, I thought it does filter optical duplicates per the documentation: "With bamaddreadbundles, optical duplicates and unpaired mates are filtered and the RB tag is created:" and I think I see it in the code of bamaddreadbundles.cc here:

if (((suppl + qcfail + unmapped + secondary + has_od) == 0) &&
      ((paired  + has_rc + has_mc + has_rb + has_mb) == 5)) {
    return true;
  }
  return false;

So after bamaddreadbundles that makes the non-deduplicated BAM files, there shouldn't be duplicates. But there seem to be.

Here is samtools flagstat for a non-deduplicated BAM file made by bamaddreadbundles, showing that there are still duplicates marked:

249561989 + 0 in total (QC-passed reads + QC-failed reads)
246436630 + 0 primary
0 + 0 secondary
3125359 + 0 supplementary
234311099 + 0 duplicates
234311099 + 0 primary duplicates
249349532 + 0 mapped (99.91% : N/A)
246224173 + 0 primary mapped (99.91% : N/A)
246436630 + 0 paired in sequencing
123218315 + 0 read1
123218315 + 0 read2
234362430 + 0 properly paired (95.10% : N/A)
246058992 + 0 with itself and mate mapped
165181 + 0 singletons (0.07% : N/A)
9763136 + 0 with mate mapped to a different chr
8564464 + 0 with mate mapped to a different chr (mapQ>=5)

The process we're doing to generate the non-deduplicated BAM is:
fastq -> extract_tags.py -> bwa mem -> bamsormadup -> bammarkduplicatesopt -> bamaddreadbundles

Then we run randomreadinbundle to generate the deduplicated BAM.

Indeed, there is much higher coverage in the non-deduplicated BAM compared to the deduplicated BAM when we ignore the samtools 1024 duplicate flag bit. But when we filter the non-deduplicated BAM to remove reads with the 1024 duplicate flag set, then the non-deduplicated BAM has lower coverage than the deduplicated BAM, even though technically the non-deduplicated BAM shouldn't have any reads marked as duplicates because bamaddreadbundles should have filtered out optical duplicates that are the only type of duplicate that has been marked.

I wonder if the issue has something to do with how the NanoSeq pipeline analyzes the optical duplicate marking of bammarkduplicatesopt.

From what I can tell bammarkduplicatesopt marks optical duplicates using the standard sam 1024 flag. But then bamaddreadbundles seems to be filtering out optical duplicates based on the presence of the 'od' tag, which is a different type of mark.

bamaddreadbundles defines the od tag as follows:
https://manpages.debian.org/unstable/biobambam2/bammarkduplicatesopt.1.en.html
"odtag=: aux field tag for storing non dup read pair name for optical duplicates"

I'm not positive how to interpret that od tag, but I'm not sure that filtering based on the presence of that tag necessarily removes all the optical duplicates?

Tracing the od tag code in biobambam2 takes me from here:
https://gitlab.com/german.tischler/biobambam2/-/blob/master/src/programs/bammarkduplicatesopt.cpp?ref_type=heads

to here:
https://gitlab.com/german.tischler/libmaus2/-/blob/master/src/libmaus2/bambam/DupMarkBase.hpp

But reading this code, it is still not clear to me what the od tag is, and how filtering reads that have the od tag set is different from filtering reads that have the 1024 SAM flag set.

Perhaps another explanation is that bammarkduplicatesopt does not correctly set the SAM 1024 flag (not sure why if so), and that filtering on the presence of the 'od' tag is the correct way to filter out optical duplicates.

@fa8sanger
Copy link
Collaborator

fa8sanger commented Dec 9, 2024 via email

@gevro
Copy link
Author

gevro commented Dec 9, 2024

Thanks for your help! I see now the source of my confusion: I thought 'PCR duplicate' marking for duplex analysis was performed later in the pipeline. But what you do is use the duplicate marks made by bamsormadup and then you distinguish optical duplicates from PCR duplicates using the od tag.

Looks like bamsormadup also marks optical duplicates with a default distance of 100, so I'm assuming bammarkduplicatesopt when finding optical duplicates ignores the prior duplicate marking of bamsormadup.

I see now also your nextflow pipeline updated bamsormadup to optminpixeldif=10 from the default of 100 -- wasn't sure the reason for that?

bamsormadup inputformat=sam level=0 blocksortverbose=0 rcsupport=1 threads=$task.cpus fragmergepar=$task.cpus optminpixeldif=10 | \\
            bammarkduplicatesopt verbose=0 level=0 index=0 optminpixeldif=2500

@fa8sanger
Copy link
Collaborator

fa8sanger commented Dec 9, 2024 via email

@gevro
Copy link
Author

gevro commented Dec 9, 2024

Yes, understood that 2500 is best for the latest Novaseq sequencers, but the part I wasn't sure about was why the preceding bamsormadup is run with optminpixeldif=10 instead of the default (optminpixeldif=100).

I would have thought it wouldn't matter what optminpixeldif value is used by bamsormadup since bammarkduplicatesopt is run afterwards.

Thanks!

@fa8sanger
Copy link
Collaborator

fa8sanger commented Dec 9, 2024 via email

@gevro
Copy link
Author

gevro commented Dec 9, 2024

Got it. So I assume the parameter optminpixeldif=10 is not really important to specify when running bamsormadup, since bammarkduplicatesopt redoes optical duplicate marking anyway? I wasn't sure why you all added optminpixeldif=10 to bamsormadup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants