feat: FastqToBam can extract UMI(s) from the comment in the read name #989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

nh13 wants to merge 3 commits into main from nh13/fastq-to-bam-umi-from-read-comment

Member

nh13 commented May 20, 2024

No description provided.


          feat: FastqToBam can extract UMI(s) from the comment in the read name

a316211

nh13 temporarily deployed to github-actions

May 20, 2024 16:53

— with

GitHub Actions Inactive

nh13 requested review from msto and tfenne

May 20, 2024 16:53

nh13 commented

View reviewed changes

src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala Outdated Show resolved Hide resolved

nh13 commented

View reviewed changes

src/test/scala/com/fulcrumgenomics/umi/UmisTest.scala Outdated Show resolved Hide resolved

nh13 commented

View reviewed changes

src/test/scala/com/fulcrumgenomics/umi/UmisTest.scala Outdated Show resolved Hide resolved

nh13 commented

View reviewed changes

src/test/scala/com/fulcrumgenomics/umi/UmisTest.scala Outdated Show resolved Hide resolved


          Apply suggestions from code review

05e573a

nh13 temporarily deployed to github-actions

May 20, 2024 16:55

— with

GitHub Actions Inactive

nh13 temporarily deployed to github-actions

May 20, 2024 16:55

— with

GitHub Actions Inactive

nh13 commented

View reviewed changes

src/test/scala/com/fulcrumgenomics/umi/UmisTest.scala Outdated Show resolved Hide resolved


          Update src/test/scala/com/fulcrumgenomics/umi/UmisTest.scala

9856fe1

nh13 temporarily deployed to github-actions

May 20, 2024 16:56

— with

GitHub Actions Inactive

nh13 temporarily deployed to github-actions

May 20, 2024 16:56

— with

GitHub Actions Inactive

codecov bot commented May 20, 2024 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.63%. Comparing base (ba0788e) to head (9856fe1).
Report is 12 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #989   +/-   ##
=======================================
  Coverage   95.62%   95.63%           
=======================================
  Files         126      126           
  Lines        7364     7380   +16     
  Branches      500      498    -2     
=======================================
+ Hits         7042     7058   +16     
  Misses        322      322

Flag	Coverage Δ
unittests	`95.63% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

msto reviewed

View reviewed changes

src/test/scala/com/fulcrumgenomics/fastq/FastqToBamTest.scala

Comment on lines +412 to +413

    
                  recs(0).apply[String]("RX") shouldBe "ACGT-CGTA-GG-CC"

                  recs(1).apply[String]("RX") shouldBe "TTGA-TAAT-TA-AA"

Contributor

msto May 20, 2024

Why are these suffixed with -GG-CC and -TA-AA?

Member Author

nh13 May 20, 2024

As per the method docs, the UMIs may be extracted from the read names, the read sequences, or both. In this case, the read structure shows UMI bases in the read sequences themselves, as well as the comment in the read name header, so we get four (!) UMI segments, two from the read sequences, and two from the comment in the read header.

nh13 requested a review from msto

May 20, 2024 22:12

nh13 assigned tfenne and unassigned tfenne

nh13 assigned msto

msto reviewed

View reviewed changes

src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala

    
                @arg(flag='q', doc="Tag in which to store molecular barcode/UMI qualities.")                 val umiQualTag: Option[String] = None,

                @arg(flag='Q', doc="Store the sample barcode qualities in the QT Tag.")                      val storeSampleBarcodeQualities: Boolean = false,

                @arg(flag='n', doc="Extract UMI(s) from read names and prepend to UMIs from reads.")         val extractUmisFromReadNames: Boolean = false,

                @arg(flag='n', doc="Extract UMI(s) from read names and prepend to UMIs from reads.", mutex=Array("extractUmisFromReadComment"))

Contributor

msto Mar 5, 2025

TIL about the mutex parameter to @arg!

Contributor

msto Mar 12, 2025

How are mutually exclusive args rendered in the CLI help doc? Is it worth noting that the two arguments are mutually exclusive in the usage above, or will that be automatically noted in the help for the respective flags?

src/main/scala/com/fulcrumgenomics/umi/Umis.scala

    
                /**

                  * Extracts the UMI from an Illumina fastq style read name.  Illumina documents their FASTQ read names as:

                  *   @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index>

                  *   `@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index>``

Contributor

msto Mar 12, 2025

Is that an extra backtick at the end?

src/main/scala/com/fulcrumgenomics/umi/Umis.scala

Comment on lines +92 to +93

    
                  * Extracts the UMI from an Illumina fastq style read name.  Illumina documents their FASTQ read names as:

                  *   `@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index>``

Contributor

msto Mar 12, 2025

Should this be updated to reflect the expected structure of the read comment?

src/main/scala/com/fulcrumgenomics/umi/Umis.scala

    
                  * If `strict` is false the last segment is returned so long as it appears to be a valid UMI.

                  */

                def extractUmisFromReadComment(comment: String, delimiter: Char = ':', strict: Boolean): Option[String] = {

                  // If strict, check that the read name actually has eight parts, which is expected

Contributor

msto Mar 12, 2025

Suggested change

      
                // If strict, check that the read name actually has eight parts, which is expected
          
                // If strict, check that the read name actually has four parts, which is expected

src/main/scala/com/fulcrumgenomics/umi/Umis.scala

Comment on lines +102 to +103

    
                  *  If `strict` is true the comment _must_ contain either 4 colon-separated segments,

                  *  with the UMI being the last in the case of 4.

Contributor

msto Mar 12, 2025

I'm far enough removed from the original motivation for this PR that I don't recall - is there a convention for the comment to have four segments?

src/main/scala/com/fulcrumgenomics/umi/Umis.scala

    
                  *  field may contain multiple UMIs, in which case they will delimit them with `+` characters.  Pluses will be

                  *  translated to hyphens before returning.

                  *

                  *  If `strict` is true the comment _must_ contain either 4 colon-separated segments,

Contributor

msto Mar 12, 2025

Suggested change

      
                *  If `strict` is true the comment _must_ contain either 4 colon-separated segments,
          
                *  If `strict` is true the comment _must_ contain 4 colon-separated segments,

As far as I can tell, there's no other permissible condition when strict is true.

strict also controls the validation - should the details of what constitutes a valid UMI be included here as well?

src/main/scala/com/fulcrumgenomics/fastq/FastqToBam.scala

Comment on lines +177 to +178

    
                      if (extractUmisFromReadNames) Umis.extractUmisFromReadName(fqs.head.name, strict=true)

                      else if (extractUmisFromReadComment) fqs.head.comment.flatMap(comment => Umis.extractUmisFromReadComment(comment, strict=true))

Contributor

msto Mar 12, 2025

Since strict is set to true and is not configurable at the CLI, should the CLI usage describe the constraints imposed by strict extraction?

msto assigned nh13 and unassigned msto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet