-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add iterator-based API for interacting with DNA #24
Conversation
One thing I should mention that may be undesirable about this API is that it's an extension trait on things that implement (more-or-less) |
Also, I'm not sure whether or not it makes sense to include the benchmarks commit; I just wanted to show that using this API is fast enough in practice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small inline comments, but also not sure about how this factors into the API as a whole.
Right now, looking at DnaSequence
, we have:
impl DnaSequence {
...
fn translate(&self, table) -> ProteinSequence;
fn translate_self_frames(&self, table) -> SmallVec<[ProteinSequence; 3]>;
fn translate_all_frames(&self, table) -> SmallVec<[ProteinSequence; 6]>;
fn reverse_complement(&self) -> Self;
...
}
Then we have the new iterator-adaptor methods: codons, complement, reading_frames, and reverse_complement.
Since DnaSequence
doesn't implement the appropriate IntoIterator
traits (which it probably should regardless of this PR, it's an oversight that it doesn't), none of these methods work on it without .as_slice()
or .iter()
. But, if it did, suddenly there's a bunch of duplicate ways to do things and nothing is really unified:
dna.translate(table)
could also be dna.codons().map(table.to_fn())
, but one returns an allocated sequence and one an iterator.
dna.translate_self_frames(table)
is also dna.reading_frames().map(|codons| codons.map(table.to_fn()).collect())
There's no simple equivalent to .translate_all_frames
, since .reading_frames()
only does the forward frames.
.reverse_complement()
has a name conflict (I think the inherent impl wins here?)
I know we want to get this merged for the perf improvements, but I'm not super happy about how it changes the API / prevents us from adding the appropriate IntoIterator impl for DnaSequence.
What do you think about making these pure iterator adapters (only on Iterator
instead of IntoIterator
) instead? Then at least you need to call .iter()
so it's slightly more consistent that these methods return iterators. Alternatively, we could name them all iter_codons
, iter_reverse_complement
, etc., but that seems... messy.
Also removed unecessary type bound on Iterator implementation of ForwardOrRcCodons.
5250dbb
to
47b9188
Compare
I'm not necessarily against changing things to apply to As for there being two ways to do things, that's intentional. I really dislike API changes that force a boil-the-ocean-in-one-go switchover. I want existing code to keep working unchanged, but I'd love to be able to submit bitesized PRs to incrementally switch things to a more iteratory approach. |
dd06691
to
6e79826
Compare
Hmm, given that the extension trait is now only implemented for |
Now that the extension trait is only implemented on on iterators, .reverse_complement() doesn't provide any convenience over .rev().complement() so it's probably better to just rely on the decoupled version.
b250956
to
a0c1c5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on keeping the old API, at least for now -- my concern was only about the inconsistency of having both types of methods intermixed on the same struct. I imagine we'll iterate (no pun intended) on the iterator-based API a bit before wanting to migrate everything over to it.
Hmm, given that the extension trait is now only implemented for
Iterator
s,iter.reverse_complement()
doesn't really provide an advantage overiter.rev().complement()
anymore so I've removed the former.
I actually disagree, I think since it's a common operation it makes sense to have as a utility method. Otherwise, LGTM!
src/iter.rs
Outdated
// It's much easier for the compiler to reason about chained Nucleotides iterator | ||
// adapters if we explicitly state that ToNucleotideLike is idempotent. | ||
Self::NucleotideType: ToNucleotideLike<NucleotideType = Self::NucleotideType>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q since I'm not familiar with this trick: Does this just give better error messages? Faster compiles? Better codegen (presumably not)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bug fix that I discovered I needed when implementing the ForwardOrRcCodons
enum. When trying to impl Iterator
for ForwardOrRcCodons
, I ran into the problem that the compiler wanted me to add a large ugly where
clause to prove that both Codons<I>
and Codons<Complement<Rev<I>>
iterated over the same type of item, whereas that feels like it should be obviously true given that RC shouldn't really change what something is iterating over. It took a bit of debugging to figure out why the compiler wasn't automatically inferring that: Technically you could add a bunch of ToNucleotideLike
definitions in a big loop that never really converges to a particular value, so the compiler was correct that Codons<I>
and Codons<Complement<Rev<I>>
weren't guaranteed to iterate over the same type of item. Adding this fixes that and some other situations I discovered during debugging where chained NucleotideIter
adapters weren't being recognized as nucleotide iterators.
This reverts commit a0c1c5e.
This is to remove the footgun of iter.reverse_complement() returning something different from iter.rev().complement().
This provides a more polished version of the API shown in in #19, hopefully making it easier to work with DNA stored in more kinds of data structures. I've also included a benchmark for using this API to generate windows with fewer allocations, though it doesn't fully satisfy #20 because we still store AAs as
&str
s, preventing us from using Rust's standard windowing.