Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iterator-based API for interacting with DNA #24

Merged
merged 11 commits into from
Dec 2, 2022
Merged

Conversation

swooster
Copy link
Contributor

This provides a more polished version of the API shown in in #19, hopefully making it easier to work with DNA stored in more kinds of data structures. I've also included a benchmark for using this API to generate windows with fewer allocations, though it doesn't fully satisfy #20 because we still store AAs as &strs, preventing us from using Rust's standard windowing.

@swooster
Copy link
Contributor Author

One thing I should mention that may be undesirable about this API is that it's an extension trait on things that implement (more-or-less) IntoIterator<Item=impl NucleotideLike>. The good news is that it means instead of having to do dna.iter().copied().reverse_complement() you can just do (&dna).reverse_complement(), but that comes with a potential footgun: The most obvious thing, dna.reverse_complement() consumes dna, and perhaps it's not entirely obvious you can work around that with (&dna).reverse_complement() or dna.iter().reverse_complement(). Also, IMHO it's less like standard Rust APIs to make iterator adapters directly available on data structures... Still, if the goal is to provide a DnaSequence<N>-like API for arbitrary data-structures, perhaps that justifies it?

@swooster
Copy link
Contributor Author

swooster commented Nov 30, 2022

Also, I'm not sure whether or not it makes sense to include the benchmarks commit; I just wanted to show that using this API is fast enough in practice.

Copy link
Contributor

@vgel vgel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small inline comments, but also not sure about how this factors into the API as a whole.

Right now, looking at DnaSequence, we have:

impl DnaSequence {
  ...
  fn translate(&self, table) -> ProteinSequence;
  fn translate_self_frames(&self, table) -> SmallVec<[ProteinSequence; 3]>;
  fn translate_all_frames(&self, table) -> SmallVec<[ProteinSequence; 6]>;
  fn reverse_complement(&self) -> Self;
  ...
}

Then we have the new iterator-adaptor methods: codons, complement, reading_frames, and reverse_complement.

Since DnaSequence doesn't implement the appropriate IntoIterator traits (which it probably should regardless of this PR, it's an oversight that it doesn't), none of these methods work on it without .as_slice() or .iter(). But, if it did, suddenly there's a bunch of duplicate ways to do things and nothing is really unified:

dna.translate(table) could also be dna.codons().map(table.to_fn()), but one returns an allocated sequence and one an iterator.

dna.translate_self_frames(table) is also dna.reading_frames().map(|codons| codons.map(table.to_fn()).collect())

There's no simple equivalent to .translate_all_frames, since .reading_frames() only does the forward frames.

.reverse_complement() has a name conflict (I think the inherent impl wins here?)


I know we want to get this merged for the perf improvements, but I'm not super happy about how it changes the API / prevents us from adding the appropriate IntoIterator impl for DnaSequence.

What do you think about making these pure iterator adapters (only on Iterator instead of IntoIterator) instead? Then at least you need to call .iter() so it's slightly more consistent that these methods return iterators. Alternatively, we could name them all iter_codons, iter_reverse_complement, etc., but that seems... messy.

src/iter.rs Outdated Show resolved Hide resolved
src/iter.rs Show resolved Hide resolved
src/iter.rs Show resolved Hide resolved
@swooster
Copy link
Contributor Author

swooster commented Dec 1, 2022

I'm not necessarily against changing things to apply to Iterators instead of IntoIterators; admittedly that feels a bit more "properly" Rust-y. One of the goals of the initial prototype was to provide an extension trait that made other types as convenient to work with as DnaSequence so there'd be no disincentive to use whichever data structure best suited the performance needs of any particular situation. As a result, the initial extension trait made methods like reverse_complement available directly on the data structure. Though, now that it's implemented for any nucleotide iterator, perhaps having to do dna.iter().reverse_complement() is more idiomatic for Rust code.

As for there being two ways to do things, that's intentional. I really dislike API changes that force a boil-the-ocean-in-one-go switchover. I want existing code to keep working unchanged, but I'd love to be able to submit bitesized PRs to incrementally switch things to a more iteratory approach.

@swooster swooster requested a review from vgel December 1, 2022 20:39
@swooster
Copy link
Contributor Author

swooster commented Dec 1, 2022

Hmm, given that the extension trait is now only implemented for Iterators, iter.reverse_complement() doesn't really provide an advantage over iter.rev().complement() anymore so I've removed the former.

Now that the extension trait is only implemented on on iterators,
.reverse_complement() doesn't provide any convenience over
.rev().complement() so it's probably better to just rely on the
decoupled version.
Copy link
Contributor

@vgel vgel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on keeping the old API, at least for now -- my concern was only about the inconsistency of having both types of methods intermixed on the same struct. I imagine we'll iterate (no pun intended) on the iterator-based API a bit before wanting to migrate everything over to it.

Hmm, given that the extension trait is now only implemented for Iterators, iter.reverse_complement() doesn't really provide an advantage over iter.rev().complement() anymore so I've removed the former.

I actually disagree, I think since it's a common operation it makes sense to have as a utility method. Otherwise, LGTM!

src/iter.rs Outdated
Comment on lines 8 to 10
// It's much easier for the compiler to reason about chained Nucleotides iterator
// adapters if we explicitly state that ToNucleotideLike is idempotent.
Self::NucleotideType: ToNucleotideLike<NucleotideType = Self::NucleotideType>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q since I'm not familiar with this trick: Does this just give better error messages? Faster compiles? Better codegen (presumably not)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug fix that I discovered I needed when implementing the ForwardOrRcCodons enum. When trying to impl Iterator for ForwardOrRcCodons, I ran into the problem that the compiler wanted me to add a large ugly where clause to prove that both Codons<I> and Codons<Complement<Rev<I>> iterated over the same type of item, whereas that feels like it should be obviously true given that RC shouldn't really change what something is iterating over. It took a bit of debugging to figure out why the compiler wasn't automatically inferring that: Technically you could add a bunch of ToNucleotideLike definitions in a big loop that never really converges to a particular value, so the compiler was correct that Codons<I> and Codons<Complement<Rev<I>> weren't guaranteed to iterate over the same type of item. Adding this fixes that and some other situations I discovered during debugging where chained NucleotideIter adapters weren't being recognized as nucleotide iterators.

This is to remove the footgun of iter.reverse_complement() returning
something different from iter.rev().complement().
@swooster swooster merged commit 312eb09 into main Dec 2, 2022
@vgel vgel deleted the swooster/iters branch December 15, 2022 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants