Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreachable code reached while inserting vectors #4

Open
Kerollmops opened this issue Jun 13, 2023 · 3 comments
Open

Unreachable code reached while inserting vectors #4

Kerollmops opened this issue Jun 13, 2023 · 3 comments

Comments

@Kerollmops
Copy link

Kerollmops commented Jun 13, 2023

Hey!

When I reached this unreachable panic, I was trying to insert points into an Hgg<DotProduct, Vec<f32>, u32> data structure. The DotProduct data structure is pretty easy but I am not sure about its correctness. Does it validate the triangle inequality? My vectors have 384 dimensions. Should they be normalized (I think they are)?

pub struct DotProduct;

impl Metric<Vec<f32>> for DotProduct {
    type Unit = u32;

    // TODO explain me this function, I don't understand why f32.to_bits is ordered.
    // I tried to do this and it wasn't OK <https://stackoverflow.com/a/43305015/1941280>
    //
    // Following <https://docs.rs/space/0.17.0/space/trait.Metric.html>.
    fn distance(&self, a: &Vec<f32>, b: &Vec<f32>) -> Self::Unit {
        let dist: f32 = a.iter().zip(b).map(|(a, b)| a * b).sum();
        let dist = 1.0 - dist;
        debug_assert!(!dist.is_nan());
        dist.to_bits()
    }
}

Could there be a way to move the hgg crate to the latest version of the space crate? As the latest v0.18 supports raw f32 as distance values.

EDIT: I tried with the latest version on main (6d1eacd) and the bug is always present.

@Kerollmops
Copy link
Author

Kerollmops commented Jun 14, 2023

Hey @vadixidav,

I created a minimal reproducible example with fairly simple values. I was going to give you the vectors of 384 dimensions, but the bug can be triggered with smaller ones. I tried with different distance functions, but I may be missing something in this DotProduct one.

/// This is a minimal reproducible example to make
/// the hgg crate trigger an internal crash.
///
/// ```toml
/// [dependencies]
/// hgg = "0.4.1"
/// space = "0.17.0"
/// ```
use hgg::Hgg;
use space::{KnnInsert, Metric};

fn main() {
    let mut hgg: Hgg<DotProduct, Vec<f32>, u32> = Hgg::default();

    hgg.insert(vec![0.1, 1.0, 0.2], 0);
    hgg.insert(vec![0.0, 0.5, 0.2], 1);
    hgg.insert(vec![0.2, 1.0, 1.2], 2);
}

#[derive(Default)]
pub struct DotProduct;

impl Metric<Vec<f32>> for DotProduct {
    type Unit = u32;

    fn distance(&self, a: &Vec<f32>, b: &Vec<f32>) -> Self::Unit {
        let dist: f32 = a.iter().zip(b).map(|(a, b)| a * b).sum();
        let dist = 1.0 - dist;
        debug_assert!(!dist.is_nan());
        dist.to_bits()
    }
}

It looks like this Euclidean function works. There is no issue when I try my previous example with this Euclidean metric struct. So it seems like the DotProduct distance function broke the triangle inequality law or something?

#[derive(Default)]
pub struct Euclidean;

impl Metric<Vec<f32>> for Euclidean {
    type Unit = u32;

    fn distance(&self, a: &Vec<f32>, b: &Vec<f32>) -> Self::Unit {
        let squared: f32 = a.iter().zip(b).map(|(a, b)| (a - b).powi(2)).sum();
        let dist = squared.sqrt();
        debug_assert!(!dist.is_nan());
        dist.to_bits()
    }
}

@vadixidav
Copy link
Member

vadixidav commented Jun 14, 2023

It looks like cosine distance is not a true distance metric as per https://towardsdatascience.com/what-is-metric-74b0bf6e862.

However, IIRC (though I am having trouble finding an answer online), if you normalize your vectors first before taking the cosine distance then they will satisfy the triangle inequality. Basically, just normalize them before taking the dot product and see if that improves things. I would recommend pre-normalizing all the vectors to avoid computationally expensive normalizing during search. Please make sure that no vector is zero when inserted, which would produce a NaN or some other non normal float.

That being said, I believe that L2 (euclidean) or L1 (manhattan/hamming) distance is typically preferred.

@vadixidav
Copy link
Member

Please refer to this section of the wikipedia page: https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity. To compute the distance for your vectors in this situation, you will need to get the angular distance. It is a bit more expensive to compute though, so I would recommend considering euclidean.

meili-bors bot added a commit to meilisearch/meilisearch that referenced this issue Jun 27, 2023
3825: Accept semantic vectors and allow users to query nearest neighbors r=Kerollmops a=Kerollmops

This Pull Request brings a new feature to the current API. The engine accepts a new `_vectors` field akin to the `_geo` one. This vector is stored in Meilisearch and can be retrieved via search. This work is the first step toward hybrid search, bringing the best of both worlds: keyword and semantic search ❤️‍🔥

## ToDo
 - [x] Make it possible to get the `limit` nearest neighbors from a user-generated vector by using the `vector` field of search route.
 - [x] Delete the documents and vectors from the HNSW-related data structures.
     - [x] Do it the slow and ugly way (we need to be able to iterate over all the values).
     - [ ] Do it the efficient way (Wait for a new method or implement it myself).
 - [ ] ~~Move from the `hnsw` crate to the hgg one~~ The hgg crate is too slow.
   Meilisearch takes approximately 88s to answer a query. It is related to the time it takes to deserialize the `Hgg` data structure or search in it. I didn't take the time to measure precisely. We moved back to the hnsw crate which takes approximately 40ms to answer.
   - [ ] ~~Wait for a fix for rust-cv/hgg#4
 - [x] Fix the current dot product function.
 - [x] Fill in the other `SearchResult` fields.
 - [x] Remove the `hnsw` dependency of the meilisearch crate.
 - [x] Fix the pages by taking the offset into account.
 - [x] Release a first prototype https://github.com/meilisearch/product/discussions/621#discussioncomment-6183647
 - [x] Make the pagination and filtering faster and more correct.
 - [x] Return the original vector in the output search results (like `query`).
 - [x] Return an `_semanticSimilarity` field in the documents (it's a dot product)
   - [x] Return this score even if the `_vectors` field is not displayed
   - [x] Rename the field `_semanticScore`.
   - [ ] Return the `_geoDistance` value even if the `_geo` field is not displayed
 - [x] Store the HNSW on possibly multiple LMDB values.
   - [ ] Measure it and make it faster if needed
   - [ ] Export the `ReadableSlices` type into a small external crate
 - [x] Accept an `_vectors` field instead of the `_vector` one.
 - [x] Normalize all vectors.
 - [ ] Remove the `_vectors` field from the default searchable attributes (as we do with `_geo`?).
 - [ ] Correctly compute the candidates by remembering the documents having a valid `_vectors` field.
 - [ ] Return the right errors:
     - [ ] Return an error when the query vector is not the same length as the vectors in the HNSW.
     - [ ] We must return the user document id that triggered the vector dimension issue.
     - [x] If an indexation error occurs.
     - [ ] Fix the error codes when using the search route.
 - [ ] ~~Introduce some settings:~~
    We currently ensure that the vector length is consistent over the whole set of documents and return an error for when a vector dimension doesn't follow the current number of dimensions.
     - [ ] The length of the vector the user will provide.
     - [ ] The distance function (we only support dot as of now).
 - [ ] Introduce other distance functions
    - [ ] Euclidean
    - [ ] Dot Product
    - [ ] Cosine
    - [ ] Make them SIMD optimized
    - [ ] Give credit to qdrant
 - [ ] Add tests.
 - [ ] Write a mini spec.
 - [ ] Release it in v1.3 as an experimental feature.

Co-authored-by: Clément Renault <[email protected]>
Co-authored-by: Kerollmops <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants