-
-
Notifications
You must be signed in to change notification settings - Fork 75
feat: FastqToBam can extract UMI(s) from the comment in the read name #989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #989 +/- ##
=======================================
Coverage 95.62% 95.63%
=======================================
Files 126 126
Lines 7364 7380 +16
Branches 500 498 -2
=======================================
+ Hits 7042 7058 +16
Misses 322 322
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
| recs(0).apply[String]("RX") shouldBe "ACGT-CGTA-GG-CC" | ||
| recs(1).apply[String]("RX") shouldBe "TTGA-TAAT-TA-AA" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these suffixed with -GG-CC and -TA-AA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the method docs, the UMIs may be extracted from the read names, the read sequences, or both. In this case, the read structure shows UMI bases in the read sequences themselves, as well as the comment in the read name header, so we get four (!) UMI segments, two from the read sequences, and two from the comment in the read header.
| @arg(flag='q', doc="Tag in which to store molecular barcode/UMI qualities.") val umiQualTag: Option[String] = None, | ||
| @arg(flag='Q', doc="Store the sample barcode qualities in the QT Tag.") val storeSampleBarcodeQualities: Boolean = false, | ||
| @arg(flag='n', doc="Extract UMI(s) from read names and prepend to UMIs from reads.") val extractUmisFromReadNames: Boolean = false, | ||
| @arg(flag='n', doc="Extract UMI(s) from read names and prepend to UMIs from reads.", mutex=Array("extractUmisFromReadComment")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about the mutex parameter to @arg!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are mutually exclusive args rendered in the CLI help doc? Is it worth noting that the two arguments are mutually exclusive in the usage above, or will that be automatically noted in the help for the respective flags?
| /** | ||
| * Extracts the UMI from an Illumina fastq style read name. Illumina documents their FASTQ read names as: | ||
| * @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index> | ||
| * `@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index>`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that an extra backtick at the end?
| * Extracts the UMI from an Illumina fastq style read name. Illumina documents their FASTQ read names as: | ||
| * `@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index>`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be updated to reflect the expected structure of the read comment?
| * If `strict` is false the last segment is returned so long as it appears to be a valid UMI. | ||
| */ | ||
| def extractUmisFromReadComment(comment: String, delimiter: Char = ':', strict: Boolean): Option[String] = { | ||
| // If strict, check that the read name actually has eight parts, which is expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // If strict, check that the read name actually has eight parts, which is expected | |
| // If strict, check that the read name actually has four parts, which is expected |
| * If `strict` is true the comment _must_ contain either 4 colon-separated segments, | ||
| * with the UMI being the last in the case of 4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm far enough removed from the original motivation for this PR that I don't recall - is there a convention for the comment to have four segments?
| * field may contain multiple UMIs, in which case they will delimit them with `+` characters. Pluses will be | ||
| * translated to hyphens before returning. | ||
| * | ||
| * If `strict` is true the comment _must_ contain either 4 colon-separated segments, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * If `strict` is true the comment _must_ contain either 4 colon-separated segments, | |
| * If `strict` is true the comment _must_ contain 4 colon-separated segments, |
As far as I can tell, there's no other permissible condition when strict is true.
strict also controls the validation - should the details of what constitutes a valid UMI be included here as well?
| if (extractUmisFromReadNames) Umis.extractUmisFromReadName(fqs.head.name, strict=true) | ||
| else if (extractUmisFromReadComment) fqs.head.comment.flatMap(comment => Umis.extractUmisFromReadComment(comment, strict=true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since strict is set to true and is not configurable at the CLI, should the CLI usage describe the constraints imposed by strict extraction?
No description provided.