Add iterator-based API for interacting with DNA #24

swooster · 2022-11-27T20:44:11Z

This provides a more polished version of the API shown in in #19, hopefully making it easier to work with DNA stored in more kinds of data structures. I've also included a benchmark for using this API to generate windows with fewer allocations, though it doesn't fully satisfy #20 because we still store AAs as &strs, preventing us from using Rust's standard windowing.

swooster · 2022-11-27T21:33:28Z

One thing I should mention that may be undesirable about this API is that it's an extension trait on things that implement (more-or-less) IntoIterator<Item=impl NucleotideLike>. The good news is that it means instead of having to do dna.iter().copied().reverse_complement() you can just do (&dna).reverse_complement(), but that comes with a potential footgun: The most obvious thing, dna.reverse_complement() consumes dna, and perhaps it's not entirely obvious you can work around that with (&dna).reverse_complement() or dna.iter().reverse_complement(). Also, IMHO it's less like standard Rust APIs to make iterator adapters directly available on data structures... Still, if the goal is to provide a DnaSequence<N>-like API for arbitrary data-structures, perhaps that justifies it?

swooster · 2022-11-30T18:30:11Z

Also, I'm not sure whether or not it makes sense to include the benchmarks commit; I just wanted to show that using this API is fast enough in practice.

vgel

A few small inline comments, but also not sure about how this factors into the API as a whole.

Right now, looking at DnaSequence, we have:

impl DnaSequence {
  ...
  fn translate(&self, table) -> ProteinSequence;
  fn translate_self_frames(&self, table) -> SmallVec<[ProteinSequence; 3]>;
  fn translate_all_frames(&self, table) -> SmallVec<[ProteinSequence; 6]>;
  fn reverse_complement(&self) -> Self;
  ...
}

Then we have the new iterator-adaptor methods: codons, complement, reading_frames, and reverse_complement.

Since DnaSequence doesn't implement the appropriate IntoIterator traits (which it probably should regardless of this PR, it's an oversight that it doesn't), none of these methods work on it without .as_slice() or .iter(). But, if it did, suddenly there's a bunch of duplicate ways to do things and nothing is really unified:

dna.translate(table) could also be dna.codons().map(table.to_fn()), but one returns an allocated sequence and one an iterator.

dna.translate_self_frames(table) is also dna.reading_frames().map(|codons| codons.map(table.to_fn()).collect())

There's no simple equivalent to .translate_all_frames, since .reading_frames() only does the forward frames.

.reverse_complement() has a name conflict (I think the inherent impl wins here?)

I know we want to get this merged for the perf improvements, but I'm not super happy about how it changes the API / prevents us from adding the appropriate IntoIterator impl for DnaSequence.

What do you think about making these pure iterator adapters (only on Iterator instead of IntoIterator) instead? Then at least you need to call .iter() so it's slightly more consistent that these methods return iterators. Alternatively, we could name them all iter_codons, iter_reverse_complement, etc., but that seems... messy.

src/iter.rs

Also removed unecessary type bound on Iterator implementation of ForwardOrRcCodons.

swooster · 2022-12-01T19:42:52Z

I'm not necessarily against changing things to apply to Iterators instead of IntoIterators; admittedly that feels a bit more "properly" Rust-y. One of the goals of the initial prototype was to provide an extension trait that made other types as convenient to work with as DnaSequence so there'd be no disincentive to use whichever data structure best suited the performance needs of any particular situation. As a result, the initial extension trait made methods like reverse_complement available directly on the data structure. Though, now that it's implemented for any nucleotide iterator, perhaps having to do dna.iter().reverse_complement() is more idiomatic for Rust code.

As for there being two ways to do things, that's intentional. I really dislike API changes that force a boil-the-ocean-in-one-go switchover. I want existing code to keep working unchanged, but I'd love to be able to submit bitesized PRs to incrementally switch things to a more iteratory approach.

swooster · 2022-12-01T23:19:20Z

Hmm, given that the extension trait is now only implemented for Iterators, iter.reverse_complement() doesn't really provide an advantage over iter.rev().complement() anymore so I've removed the former.

Now that the extension trait is only implemented on on iterators, .reverse_complement() doesn't provide any convenience over .rev().complement() so it's probably better to just rely on the decoupled version.

vgel

Agreed on keeping the old API, at least for now -- my concern was only about the inconsistency of having both types of methods intermixed on the same struct. I imagine we'll iterate (no pun intended) on the iterator-based API a bit before wanting to migrate everything over to it.

Hmm, given that the extension trait is now only implemented for Iterators, iter.reverse_complement() doesn't really provide an advantage over iter.rev().complement() anymore so I've removed the former.

I actually disagree, I think since it's a common operation it makes sense to have as a utility method. Otherwise, LGTM!

vgel · 2022-12-02T21:05:31Z

src/iter.rs

+    // It's much easier for the compiler to reason about chained Nucleotides iterator
+    // adapters if we explicitly state that ToNucleotideLike is idempotent.
+    Self::NucleotideType: ToNucleotideLike<NucleotideType = Self::NucleotideType>,


Q since I'm not familiar with this trick: Does this just give better error messages? Faster compiles? Better codegen (presumably not)?

It's a bug fix that I discovered I needed when implementing the ForwardOrRcCodons enum. When trying to impl Iterator for ForwardOrRcCodons, I ran into the problem that the compiler wanted me to add a large ugly where clause to prove that both Codons and Codons<Complement<Rev> iterated over the same type of item, whereas that feels like it should be obviously true given that RC shouldn't really change what something is iterating over. It took a bit of debugging to figure out why the compiler wasn't automatically inferring that: Technically you could add a bunch of ToNucleotideLike definitions in a big loop that never really converges to a particular value, so the compiler was correct that Codons and Codons<Complement<Rev> weren't guaranteed to iterate over the same type of item. Adding this fixes that and some other situations I discovered during debugging where chained NucleotideIter adapters weren't being recognized as nucleotide iterators.

This reverts commit a0c1c5e.

This is to remove the footgun of iter.reverse_complement() returning something different from iter.rev().complement().

swooster added 2 commits November 27, 2022 12:04

Add iterator-based APIs for nucleotide sequences

3c023d1

Add benchmark for iterator-based window generation

c61f69a

swooster mentioned this pull request Nov 27, 2022

Allow non-owned Sequence types #19

Open

swooster requested review from vgel and lynn November 27, 2022 21:24

vgel reviewed Dec 1, 2022

View reviewed changes

src/iter.rs Outdated Show resolved Hide resolved

src/iter.rs Show resolved Hide resolved

src/iter.rs Show resolved Hide resolved

swooster added 3 commits November 30, 2022 21:27

Fix reversed codon iterators

1a6543b

Add all_reading_frames and rename reading_frames to self_reading_frames

683490e

Add explicit len implementations

47b9188

Also removed unecessary type bound on Iterator implementation of ForwardOrRcCodons.

swooster force-pushed the swooster/iters branch from 5250dbb to 47b9188 Compare December 1, 2022 06:41

Switch extension trait from IntoIterators to Interators

6e79826

swooster force-pushed the swooster/iters branch from dd06691 to 6e79826 Compare December 1, 2022 20:26

Add example to TranslationTable::to_fn

df75fce

swooster requested a review from vgel December 1, 2022 20:39

Fix comment to mention NucleotideIter instead of Nucleotides

0460089

Remove reverse_complement

a0c1c5e

Now that the extension trait is only implemented on on iterators, .reverse_complement() doesn't provide any convenience over .rev().complement() so it's probably better to just rely on the decoupled version.

swooster force-pushed the swooster/iters branch from b250956 to a0c1c5e Compare December 1, 2022 23:20

vgel approved these changes Dec 2, 2022

View reviewed changes

swooster added 2 commits December 2, 2022 13:57

Revert "Remove reverse_complement"

73588db

This reverts commit a0c1c5e.

Swap nesting of from Rev<Complement> to Complement<Rev>

106ebf1

This is to remove the footgun of iter.reverse_complement() returning something different from iter.rev().complement().

swooster merged commit 312eb09 into main Dec 2, 2022

vgel deleted the swooster/iters branch December 15, 2022 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add iterator-based API for interacting with DNA #24

Add iterator-based API for interacting with DNA #24

swooster commented Nov 27, 2022

swooster commented Nov 27, 2022

swooster commented Nov 30, 2022 •

edited

Loading

vgel left a comment

swooster commented Dec 1, 2022 •

edited

Loading

swooster commented Dec 1, 2022

vgel left a comment

vgel Dec 2, 2022

swooster Dec 2, 2022

Add iterator-based API for interacting with DNA #24

Add iterator-based API for interacting with DNA #24

Conversation

swooster commented Nov 27, 2022

swooster commented Nov 27, 2022

swooster commented Nov 30, 2022 • edited Loading

vgel left a comment

Choose a reason for hiding this comment

swooster commented Dec 1, 2022 • edited Loading

swooster commented Dec 1, 2022

vgel left a comment

Choose a reason for hiding this comment

vgel Dec 2, 2022

Choose a reason for hiding this comment

swooster Dec 2, 2022

Choose a reason for hiding this comment

swooster commented Nov 30, 2022 •

edited

Loading

swooster commented Dec 1, 2022 •

edited

Loading