Spare not needed groupBy when calling toFragments() on AlignmentDataset #2281

benraha · 2020-11-07T06:05:29Z

Hi!

I'm running a process that is pre-processing a bunch of reads before aligning them using Bowtie. Most of them are unpaired, so when I run toFragments(), I need to groupBy() them for no actual reason. Is there a way to spare this groupBy?

Looking at the code, I think we can add a variable to signify when we know for sure when we have unpaired files. When we are unsure, we'll do the groupBy anyway (maybe let the user tell us by adding a parameter to loadAlignments).

I'd love to implement it.

WDYT?
Ben

benraha · 2020-11-10T17:34:48Z

@heuermh Would love your thoughts on that before I implement it.

heuermh · 2020-11-10T18:06:37Z

If all you want to do is a straight conversion 1:1 of Alignment to Fragment, there are the transmute/transmuteDataFrame/transmuteDataset APIs, e.g.

https://javadoc.io/static/org.bdgenomics.adam/adam-core-spark3_2.12/0.32.0/org/bdgenomics/adam/rdd/read/AlignmentDataset.html#transmute[X,Y%3C:Product,Z%3C:org.bdgenomics.adam.rdd.GenomicDataset[X,Y,Z]](tFn:org.apache.spark.api.java.function.Function[org.apache.spark.api.java.JavaRDD[T],org.apache.spark.api.java.JavaRDD[X]],convFn:org.apache.spark.api.java.function.Function2[V,org.apache.spark.rdd.RDD[X],Z]):Z

An example of this can be found in the unit tests
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentDatasetSuite.scala#L126
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentDatasetSuite.scala#L1543

I think a new method toUnpairedFragments() that leaves out the groupBy might be ok.

Then for calling bowtie, in Cannoli we have bowtie2, a function FragmentDataset → AlignmentDataset, and singleEndBowtie2, a function AlignmentDataset → AlignmentDataset. If starting from mixed set of reads, you could filter out unpaired reads and run them separately through singleEndBowtie2 as to not incur the cost of toFragments and then union the results together.

There isn't currently a singleEndBowtie in Cannoli but I doubt it would be difficult to add one.

benraha · 2020-11-10T18:31:37Z

These are good, but I want to use the knowledge ADAM already has on the data instead of relying on the user to know it, or maybe there's some problem regarding this that I don't know of?

Something like that (taken from loadAlignments):

BAM -> unpaired
InterleavedFastQ -> paired
FASTQ -> paired / unpaired like ADAM works today
FASTA -> unpaired?
PARQUET -> can be paired

heuermh · 2020-11-10T21:51:37Z

Those assumptions can fall apart though, from experience BAM/CRAM/SAM files can contain paired reads, unpaired reads, aligned reads, and unaligned reads. It is common to use unaligned BAM (uBAM) in workflows instead of FASTQ because it compresses better.

We would of course encourage the use of Parquet because it compresses better, doesn't have problems with split guessing, can take advantage of push down predicates and column projection, and can be read/write concurrently in distributed fashion across a cluster. 😉

That said, please feel free to suggest changes!

benraha changed the title ~~Save not needed groupBy when calling toFragments() on AlignmentDataset~~ Spare not needed groupBy when calling toFragments() on AlignmentDataset Nov 7, 2020

benraha mentioned this issue Nov 18, 2020

Added an optimisation to spare not required shuffle of unpaired reads #2282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spare not needed groupBy when calling toFragments() on AlignmentDataset #2281

Spare not needed groupBy when calling toFragments() on AlignmentDataset #2281

benraha commented Nov 7, 2020

benraha commented Nov 10, 2020

heuermh commented Nov 10, 2020

benraha commented Nov 10, 2020

heuermh commented Nov 10, 2020

Spare not needed groupBy when calling toFragments() on AlignmentDataset #2281

Spare not needed groupBy when calling toFragments() on AlignmentDataset #2281

Comments

benraha commented Nov 7, 2020

benraha commented Nov 10, 2020

heuermh commented Nov 10, 2020

benraha commented Nov 10, 2020

heuermh commented Nov 10, 2020