-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MMQS filter is overly strict in fp_filter #972
Comments
Copying old notes here for reference: Problem: There are two point mutations p.Q61 is filtered as DOCM_ONLY chr12:25227341 T>G. p.R68 does not appear in the vcf outputs chr12:25227322 T>G. VAFs are high both variants are mostly supported by the same reads. read depth there is 83,34 for a VAF of 29% but called by docm only - mutect, strelka, and varscan all filtered it out with There are not more than just those two mutations in those reads if you zoom out. Mutect also dropped them before the FP filter as being "clustered events" these are the bam-readcount values for that site: Here's a summary of the mismatch quality sum (MMQS) filter that runs as part of the false-positive filter. You can read the definition of MMQS from Travis here: https://www.biostars.org/p/69910/#70336 This is meant to remove mismappings due to paralogous sequence, which are a problem. They result in mappings that look a lot like this: 2-10 high-quality bases being present in lots of reads. Unfortunately, that's also what real phased events look like, and it's possible that with the advent of longer reads and hiqher quality scores, the MMQS threshold of 50 is too low. small sample size, but there are only two other sites in this sample removed solely by this filter, and they both are clearly garbage The two bad sites in this sample have mmqs differences of 95.47 and 78.4, so not miles above the 60.3 of the good site in KRASOkay, I've reviewed 4 samples now, each of which only has a fairly small number of sites that only fail this filter. (3-9 variants). For these sites, I calculated the mmqs_diff score and manually reviewed it to see a) if it was a good site or not and b) if any of the later filters (mapq, llr) would have caught it. These are the results:
I did find one additional site that appeared real, so that's 2/24 = 8% false negative rate for this filter, based on this limited sample size. Another 9 sites would have been caught by a subsequent filter, but that'd still leave us with 13 false positives across 4 samples. Not awful, but not great either. Two common patterns in the sites:
i feel like maybe #2 could be caught by messing with some of the other read/base quality params but I don't have a clear idea of what to require to weed out #1 without also removing sites like the KRAS above. Open to ideas on how to alter this filter to rescue these sites without losing the specificity bump from using this filter. |
Occasionally SNVs near each other will be removed inappropriately if they cause the MMQS value to exceed the threshold. It does still remove plenty of "junk", but we may want to consider replacing it or tweaking the parameters.
The text was updated successfully, but these errors were encountered: