Implement `find` as an iterator in `SBT` to provide library users more options #2665

morsecodist · 2023-07-03T16:05:32Z

Hello! I work on Chan Zuckerberg Infectious Disease and we are working on using sourmash for removing redundant sequences. You can take a look at some of our efforts here, though we are still in the exploratory stages so there is not much there in the way of documentation. We are using an approach where if any sequence we already have is closer than the threshold to a given sequence we want to omit that sequence. I am using sourmash as a library for this. I have implemented my own slightly different SBT heavily based on what you already have here for our use case. I think that all or some of this may make a nice addition to the library.

One simple piece of low hanging fruit that I feel would benefit all users of the library is implementing find as an iterator. If we find a similar sequence we want to stop the search instead of continuing to search the whole tree. I was going to implement a method that either checks for the presence of a signature or returns the first one. However, I realized an iterator solves a lot of these use cases simultaneously. Users of the library will be able to stop the search after an arbitrary number of matches as well as do things like filtering and mapping without adding filtered elements to the final vector.

This is a minimal addition. I didn't add an iterator version of find to the Index trait yet but if you are OK with going in this direction we can merge in this change and I would be happy to implement more iterators for more Indexes. We can also use the default implementation which already uses an iterator and converts it to a vector.

I didn't end up writing new tests because I modified the existing find method to depend on the iterator implementation. This way it is covered by the existing find test cases.

ctb · 2023-07-03T17:27:35Z

hi @morsecodist this sounds fantastic! thanks for engaging!

note to @luizirber this is touching the Rust code / implementation of SBT!

(I've approved your PR for GitHub actions, also.)

A few hot takes -

first, no concerns on my side about switching find over to being an iterator in general; I want to remind myself of the find code details before committing to it, but seems fine!
On that front, I note that find is implemented as a generator on the base Python Index class already, which seems as or more general than an iterator, so this should fit in ok with the current Python API?
you might also be interested in explore truncation of search results based on overlap in the Index.find(...) protocol. #1925, where I (conceptually, and with the aid of a test - see [MRG] add some tests for Jaccard output ordering #1926) explore truncating of search results.

thanks again!

codecov · 2023-07-03T17:32:09Z

Codecov Report

Attention: Patch coverage is 0% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 86.78%. Comparing base (fe4790f) to head (9582770).

Files	Patch %	Lines
src/core/src/index/mod.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           latest    #2665   +/-   ##
=======================================
  Coverage   86.78%   86.78%           
=======================================
  Files         136      136           
  Lines       15524    15524           
  Branches     2626     2626           
=======================================
  Hits        13472    13472           
  Misses       1751     1751           
  Partials      301      301

Flag	Coverage Δ
hypothesis-py	`25.53% <ø> (ø)`
python	`92.86% <ø> (ø)`
rust	`61.21% <0.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

morsecodist · 2023-07-03T18:01:30Z

@ctb Thanks for the quick response and for linking. My tree implementation makes use of the principle from this issue. I would love to discuss folding that in though if there is a better place than this PR let me know. My tree has a few key differences:

It uses containment to avoid searching down certain tree branches. This was a huge deal for our use case because we are only adding sequences to the tree if they are far away from the existing sequences so we are likely to be able to give up fairly early. This works for both jaccard and containment search functions because jaccard will always be less than or equal to containment. However we run into difficulties with the API because we can't use this method given an arbitrary search function. We would need to somehow constrain it or perhaps require two search functions with the properties we need.
It doesn't differentiate between internal and leaf nodes. This lets me add to the tree without shuffling. Each node serves a dual role as an internal node and a leaf node. This seemed to make a difference to performance.
It uses rayon to search levels of the tree in parallel.
It support arbitrary branching factors. I noted that tweaking the branching factor can make a bit of a performance difference as well, especially when combined with per-level parallelism.

I would be happy to go into more detail about it. I was planning on adding some of these elements separately either to the core SBT tree or making a new implementation of it with these properties. This is all compatible with find as an iterator so I think it may make sense to land this less disruptive change first then maybe build some more optimizations on top of it.

ctb · 2023-07-03T18:08:11Z

hey, and look, all your tests passed 🎉

I also forgot to mention this hot take: we have an early-stage but (in my experience) quite functional plug-in system, see #2428 and #2438. It should be easy to add your own experimental Index implementations without contributing to core, if and as you wish to go that way. Happy to chat more about this as you have questions.

I like your incrementalist approach!

luizirber · 2023-07-03T23:31:24Z

Wholeheartedly agree with the find changes! The Index trait design still suffers from being one of the first things I tried out/designed in Rust... So Iterators make a lot of sense where you added them =]

In #2230 I... removed the SBT code. Because I figured no one was using it. Should it be kept around?
(I should also probably split #2230 into the RevIndex/mastiff implementation, and all the other changes to the API, including some concepts available in Python, like Manifests and Picklists...)

morsecodist · 2023-07-05T18:47:21Z

@luizirber for my tree I actually am not using any of the SBT code and it is much more minimal than what you have for SBTs. From what it sounds like if you are open to having this index type there are two ways forward:

I could implement it as an SBT and keep the existing SBT code
I could add it as a new more slimmed down tree index implementation

Would you be OK with either of these and if so which would you prefer.

Also it sounds like you are happy with a full refactor of find and search to iterators rather than just this one change. I am happy to do that as well as a separate PR. So two total changes:

A refactor of the Index Rust code to turn find and search into iterators and updating relevant implementations
Folding in a tree based index with these optimizations

Does that make sense as a breakdown?

ctb · 2023-08-22T14:07:18Z

hi @morsecodist please see #2230 (comment) - @luizirber is splitting out the big PR into many small bits!

In particular, for now, we suggest going this route,

I could add it as a new more slimmed down tree index implementation

and keeping it all in Rust - no need to expose to Python. You can keep your tree index separated and not in core - use sourmash-rust as dep here. In terms of exposing it to Python, you could use plugins for that, if it's desired; see https://github.com/sourmash-bio/pyo3_branchwater for my nascent efforts there on other fronts.

In re:

Also it sounds like you are happy with a full refactor of find and search to iterators rather than just this one change. I am happy to do that as well as a separate PR.

we are definitely on board with this, but @luizirber has some thoughts about how to make your life easier here by splitting off & merging the removal of SBT and BIGSI in #2230 so you don't need to deal with that code.

In brief: please let us know if/when this PR is ready for full review!

@luizirber let me know if I got that all right 😆

(spun off #2230, might help with #2665) BIGSI and SBT are prototype-level code (and some of the first Rust code I wrote...), and mostly makes it harder to change/refactor other parts of the codebase. Can bring it back later in the future if needed, but `mastiff` cover many of the same use cases.

morsecodist added 4 commits July 3, 2023 08:28

find iter lifetimes

0f713c2

docs

fd17e0b

author

bc854a5

initial

857404f

morsecodist changed the title ~~Find iter lifetimes~~ Implement find as an iterator in SBT to provide library users more options Jul 3, 2023

ctb added the rust label Jul 3, 2023

test updates

97030e1

luizirber mentioned this pull request Aug 21, 2023

Feat: on-disk RevIndex based on RocksDB #2230

Merged

luizirber mentioned this pull request Aug 23, 2023

Remove BIGSI and SBT code from Rust layer #2732

Merged

luizirber added 2 commits December 4, 2023 18:50

Merge remote-tracking branch 'origin/latest' into find-iter-lifetimes

71fdee3

Merge remote-tracking branch 'origin/latest' into find-iter-lifetimes

9582770

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `find` as an iterator in `SBT` to provide library users more options #2665

Implement `find` as an iterator in `SBT` to provide library users more options #2665

morsecodist commented Jul 3, 2023

ctb commented Jul 3, 2023

codecov bot commented Jul 3, 2023 •

edited

Loading

morsecodist commented Jul 3, 2023

ctb commented Jul 3, 2023

luizirber commented Jul 3, 2023 •

edited

Loading

morsecodist commented Jul 5, 2023

ctb commented Aug 22, 2023

Implement find as an iterator in SBT to provide library users more options #2665

Are you sure you want to change the base?

Implement find as an iterator in SBT to provide library users more options #2665

Conversation

morsecodist commented Jul 3, 2023

ctb commented Jul 3, 2023

codecov bot commented Jul 3, 2023 • edited Loading

Codecov Report

morsecodist commented Jul 3, 2023

ctb commented Jul 3, 2023

luizirber commented Jul 3, 2023 • edited Loading

morsecodist commented Jul 5, 2023

ctb commented Aug 22, 2023

Implement `find` as an iterator in `SBT` to provide library users more options #2665

Implement `find` as an iterator in `SBT` to provide library users more options #2665

codecov bot commented Jul 3, 2023 •

edited

Loading

luizirber commented Jul 3, 2023 •

edited

Loading