Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize quantization state to store statistical profile of vectors in the segment #2243

Open
jmazanec15 opened this issue Oct 31, 2024 · 0 comments
Labels
Features Introduces a new unit of functionality that satisfies a requirement

Comments

@jmazanec15
Copy link
Member

Description

As part of the quantization framework, we added functionality to sample data going into a segment, perform some kind of statistical profiling on them, and then serialize it to the quantization state file: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN990Codec/KNN990QuantizationStateWriter.java.

I think itd be pretty interesting to generalize this framework to get insights into the vector data going into the segments. This could then be used to either debug recall issues at the segment level (i.e. why quantization is not working as well) or it could be used to make decisions about index configuration. For a fairly trivial example, by looking at the data range, we could determine if no recall would be lost if we went from fp32 to fp16.

Some statistics could be:

  1. Per-dimension mean
  2. Per-dimension quantiles
  3. Per-dimension variance
  4. Sparsity metric
  5. Intrinisic dimensionality
    Would need to do a more thorough brainstorm on this.
@jmazanec15 jmazanec15 added the Features Introduces a new unit of functionality that satisfies a requirement label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement
Projects
None yet
Development

No branches or pull requests

1 participant