-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreachable code reached while inserting vectors #4
Comments
Hey @vadixidav, I created a minimal reproducible example with fairly simple values. I was going to give you the vectors of 384 dimensions, but the bug can be triggered with smaller ones. I tried with different distance functions, but I may be missing something in this /// This is a minimal reproducible example to make
/// the hgg crate trigger an internal crash.
///
/// ```toml
/// [dependencies]
/// hgg = "0.4.1"
/// space = "0.17.0"
/// ```
use hgg::Hgg;
use space::{KnnInsert, Metric};
fn main() {
let mut hgg: Hgg<DotProduct, Vec<f32>, u32> = Hgg::default();
hgg.insert(vec![0.1, 1.0, 0.2], 0);
hgg.insert(vec![0.0, 0.5, 0.2], 1);
hgg.insert(vec![0.2, 1.0, 1.2], 2);
}
#[derive(Default)]
pub struct DotProduct;
impl Metric<Vec<f32>> for DotProduct {
type Unit = u32;
fn distance(&self, a: &Vec<f32>, b: &Vec<f32>) -> Self::Unit {
let dist: f32 = a.iter().zip(b).map(|(a, b)| a * b).sum();
let dist = 1.0 - dist;
debug_assert!(!dist.is_nan());
dist.to_bits()
}
} It looks like this Euclidean function works. There is no issue when I try my previous example with this #[derive(Default)]
pub struct Euclidean;
impl Metric<Vec<f32>> for Euclidean {
type Unit = u32;
fn distance(&self, a: &Vec<f32>, b: &Vec<f32>) -> Self::Unit {
let squared: f32 = a.iter().zip(b).map(|(a, b)| (a - b).powi(2)).sum();
let dist = squared.sqrt();
debug_assert!(!dist.is_nan());
dist.to_bits()
}
} |
It looks like cosine distance is not a true distance metric as per https://towardsdatascience.com/what-is-metric-74b0bf6e862.
That being said, I believe that L2 (euclidean) or L1 (manhattan/hamming) distance is typically preferred. |
Please refer to this section of the wikipedia page: https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity. To compute the distance for your vectors in this situation, you will need to get the angular distance. It is a bit more expensive to compute though, so I would recommend considering euclidean. |
3825: Accept semantic vectors and allow users to query nearest neighbors r=Kerollmops a=Kerollmops This Pull Request brings a new feature to the current API. The engine accepts a new `_vectors` field akin to the `_geo` one. This vector is stored in Meilisearch and can be retrieved via search. This work is the first step toward hybrid search, bringing the best of both worlds: keyword and semantic search ❤️🔥 ## ToDo - [x] Make it possible to get the `limit` nearest neighbors from a user-generated vector by using the `vector` field of search route. - [x] Delete the documents and vectors from the HNSW-related data structures. - [x] Do it the slow and ugly way (we need to be able to iterate over all the values). - [ ] Do it the efficient way (Wait for a new method or implement it myself). - [ ] ~~Move from the `hnsw` crate to the hgg one~~ The hgg crate is too slow. Meilisearch takes approximately 88s to answer a query. It is related to the time it takes to deserialize the `Hgg` data structure or search in it. I didn't take the time to measure precisely. We moved back to the hnsw crate which takes approximately 40ms to answer. - [ ] ~~Wait for a fix for rust-cv/hgg#4 - [x] Fix the current dot product function. - [x] Fill in the other `SearchResult` fields. - [x] Remove the `hnsw` dependency of the meilisearch crate. - [x] Fix the pages by taking the offset into account. - [x] Release a first prototype https://github.com/meilisearch/product/discussions/621#discussioncomment-6183647 - [x] Make the pagination and filtering faster and more correct. - [x] Return the original vector in the output search results (like `query`). - [x] Return an `_semanticSimilarity` field in the documents (it's a dot product) - [x] Return this score even if the `_vectors` field is not displayed - [x] Rename the field `_semanticScore`. - [ ] Return the `_geoDistance` value even if the `_geo` field is not displayed - [x] Store the HNSW on possibly multiple LMDB values. - [ ] Measure it and make it faster if needed - [ ] Export the `ReadableSlices` type into a small external crate - [x] Accept an `_vectors` field instead of the `_vector` one. - [x] Normalize all vectors. - [ ] Remove the `_vectors` field from the default searchable attributes (as we do with `_geo`?). - [ ] Correctly compute the candidates by remembering the documents having a valid `_vectors` field. - [ ] Return the right errors: - [ ] Return an error when the query vector is not the same length as the vectors in the HNSW. - [ ] We must return the user document id that triggered the vector dimension issue. - [x] If an indexation error occurs. - [ ] Fix the error codes when using the search route. - [ ] ~~Introduce some settings:~~ We currently ensure that the vector length is consistent over the whole set of documents and return an error for when a vector dimension doesn't follow the current number of dimensions. - [ ] The length of the vector the user will provide. - [ ] The distance function (we only support dot as of now). - [ ] Introduce other distance functions - [ ] Euclidean - [ ] Dot Product - [ ] Cosine - [ ] Make them SIMD optimized - [ ] Give credit to qdrant - [ ] Add tests. - [ ] Write a mini spec. - [ ] Release it in v1.3 as an experimental feature. Co-authored-by: Clément Renault <[email protected]> Co-authored-by: Kerollmops <[email protected]>
Hey!
When I reached this unreachable panic, I was trying to insert points into an
Hgg<DotProduct, Vec<f32>, u32>
data structure. TheDotProduct
data structure is pretty easy but I am not sure about its correctness. Does it validate the triangle inequality? My vectors have 384 dimensions. Should they be normalized (I think they are)?Could there be a way to move the hgg crate to the latest version of the space crate? As the latest v0.18 supports raw f32 as distance values.
EDIT: I tried with the latest version on main (6d1eacd) and the bug is always present.
The text was updated successfully, but these errors were encountered: