-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[ENH] Update sparse vector similarity metric #5406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Update sparse vector similarity metric #5406
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Unify Sparse Vector Similarity Metric and Operator Logic (Inner Product to 1 - Inner Product, Remove Obsolete Merge Code) This PR standardizes the similarity metric for sparse vectors by changing it from 'inner product' to '1 - inner product' throughout the codebase, aligning with the metric used for dense vectors. It refactors and updates impacted operators, orchestration code, and tests, removes the now-unnecessary SparseKnnMerge implementation, and ensures that ranking and merging logic returns results in ascending order (where smaller values represent higher similarity). The refactor also updates all associated function signatures, test logic, comments, and documentation to match the new convention. Key Changes• Similarity metric for sparse vector search changed from inner product to 1 - inner product for consistency with dense vector logic. Affected Areas• Sparse vector search operators: This summary was automatically generated by @propel-code-bot |
f5acb7f
to
cf9044d
Compare
@@ -78,7 +78,7 @@ impl Operator<SparseLogKnnInput, SparseLogKnnOutput> for SparseLogKnn { | |||
|
|||
let logs = materialize_logs(&record_segment_reader, input.logs.clone(), None).await?; | |||
|
|||
let mut min_heap = BinaryHeap::with_capacity(self.limit as usize); | |||
let mut max_heap = BinaryHeap::with_capacity(self.limit as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The current heap-based top-k selection logic doesn't handle the edge case where limit
is 0. If self.limit
is 0, the implementation will incorrectly return one result instead of an empty vector.
To ensure correctness, please add a check at the beginning of the run
function to handle this case explicitly:
if self.limit == 0 {
return Ok(SparseLogKnnOutput {
records: Vec::new(),
});
}
Context for Agents
[**BestPractice**]
The current heap-based top-k selection logic doesn't handle the edge case where `limit` is 0. If `self.limit` is 0, the implementation will incorrectly return one result instead of an empty vector.
To ensure correctness, please add a check at the beginning of the `run` function to handle this case explicitly:
```rust
if self.limit == 0 {
return Ok(SparseLogKnnOutput {
records: Vec::new(),
});
}
```
File: rust/worker/src/execution/operators/sparse_log_knn.rs
Line: 81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, but we shouldn't really allow limit=0
cf9044d
to
0f73a79
Compare
9e70931
to
459fc7e
Compare
if (max_heap.len() as u32) < self.limit { | ||
max_heap.push(RecordMeasure { | ||
offset_id: log.get_offset_id(), | ||
measure: score, | ||
})); | ||
} else if min_heap | ||
.peek() | ||
.map(|Reverse(record)| record.measure) | ||
.unwrap_or(f32::MIN) | ||
< score | ||
}); | ||
} else if score | ||
< max_heap | ||
.peek() | ||
.map(|record| record.measure) | ||
.unwrap_or(f32::MAX) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Logic error in heap management: You changed from a min_heap
with Reverse
wrapper to a max_heap
without Reverse
, but the comparison logic is incorrect for the new similarity metric.
With the new metric 1.0 - dot_product
, smaller values are better (more similar). You should use a max heap to track the worst (largest) scores and evict them when better (smaller) scores are found.
Current logic:
} else if score < max_heap.peek().map(|record| record.measure).unwrap_or(f32::MAX) {
This will replace elements when the new score is smaller than the heap's maximum, which is correct. However, using BinaryHeap
(max heap) means the largest scores stay at the top and get evicted first, which is what we want.
The logic appears correct, but consider adding a comment to clarify that we're using a max heap to track the worst scores for easier understanding.
Context for Agents
[**BestPractice**]
Logic error in heap management: You changed from a `min_heap` with `Reverse` wrapper to a `max_heap` without `Reverse`, but the comparison logic is incorrect for the new similarity metric.
With the new metric `1.0 - dot_product`, **smaller values are better** (more similar). You should use a **max heap** to track the **worst** (largest) scores and evict them when better (smaller) scores are found.
Current logic:
```rust
} else if score < max_heap.peek().map(|record| record.measure).unwrap_or(f32::MAX) {
```
This will replace elements when the new score is smaller than the heap's maximum, which is correct. However, using `BinaryHeap` (max heap) means the largest scores stay at the top and get evicted first, which is what we want.
The logic appears correct, but consider adding a comment to clarify that we're using a max heap to track the worst scores for easier understanding.
File: rust/worker/src/execution/operators/sparse_log_knn.rs
Line: 111
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the description seems inconsistent here. seems to be a suggestion for better comment.
459fc7e
to
c156092
Compare
0f73a79
to
bd18a40
Compare
Summary: 1 successful workflow, 1 pending workflow
Last updated: 2025-09-09 16:52:12 UTC |
c156092
to
8c99f63
Compare
bd18a40
to
3671353
Compare
3671353
to
af93d8c
Compare
8c99f63
to
17c2358
Compare
af93d8c
to
f978112
Compare
17c2358
to
f2dbbfd
Compare
Merge activity
|
Description of changes
Summarize the changes made by this PR.
Rank
operator so that the results are returned in increasing score. Smaller score means higher similarity.SparseKnnMerge
implementation as it is unnecessary nowTest plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?