Skip to content

Conversation

Sicheng-Pan
Copy link
Contributor

@Sicheng-Pan Sicheng-Pan commented Sep 3, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Change sparse vector metric from inner product to one minus inner product. This is consistent with our dense vector similarity metric.
    • Updates Rank operator so that the results are returned in increasing score. Smaller score means higher similarity.
    • Removes SparseKnnMerge implementation as it is unnecessary now
  • New functionality
    • N/A

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link

github-actions bot commented Sep 3, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@Sicheng-Pan Sicheng-Pan marked this pull request as ready for review September 3, 2025 22:11
Copy link
Contributor

propel-code-bot bot commented Sep 3, 2025

Unify Sparse Vector Similarity Metric and Operator Logic (Inner Product to 1 - Inner Product, Remove Obsolete Merge Code)

This PR standardizes the similarity metric for sparse vectors by changing it from 'inner product' to '1 - inner product' throughout the codebase, aligning with the metric used for dense vectors. It refactors and updates impacted operators, orchestration code, and tests, removes the now-unnecessary SparseKnnMerge implementation, and ensures that ranking and merging logic returns results in ascending order (where smaller values represent higher similarity). The refactor also updates all associated function signatures, test logic, comments, and documentation to match the new convention.

Key Changes

• Similarity metric for sparse vector search changed from inner product to 1 - inner product for consistency with dense vector logic.
• Rank operator updated: output is now sorted in increasing score (smaller values indicate higher similarity).
• Obsolete SparseKnnMerge operator and related types/tests removed; the generic KnnMerge is used instead for both dense and sparse cases.
• All operator, handler, and orchestration code (including Knn, Spann, and SparseKnn orchestrators) updated to use new naming and measure semantics.
• All affected unit and integration tests updated for new similarity metric and result ordering.
• Documentation and comments revised to clarify meaning of similarity values under the new metric.
• Naming across struct and function signatures unified: e.g., distances renamed to measures, and batch_distances to batch_measures.
• Refactored heap-based selection logic in sparse search to correctly identify top-k in the new similarity space.

Affected Areas

• Sparse vector search operators: sparse_log_knn.rs, sparse_index_knn.rs
• Rank operator (rank.rs)
• Merge logic: knn_merge.rs (now used for sparse and dense search)
SparseKnn and SpannKnn orchestrators
• Test and benchmark code involving sparse similarity
• Operator module registry
• General documentation and tests related to similarity computation

This summary was automatically generated by @propel-code-bot

@Sicheng-Pan Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from f5acb7f to cf9044d Compare September 3, 2025 22:32
@@ -78,7 +78,7 @@ impl Operator<SparseLogKnnInput, SparseLogKnnOutput> for SparseLogKnn {

let logs = materialize_logs(&record_segment_reader, input.logs.clone(), None).await?;

let mut min_heap = BinaryHeap::with_capacity(self.limit as usize);
let mut max_heap = BinaryHeap::with_capacity(self.limit as usize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The current heap-based top-k selection logic doesn't handle the edge case where limit is 0. If self.limit is 0, the implementation will incorrectly return one result instead of an empty vector.

To ensure correctness, please add a check at the beginning of the run function to handle this case explicitly:

if self.limit == 0 {
    return Ok(SparseLogKnnOutput {
        records: Vec::new(),
    });
}
Context for Agents
[**BestPractice**]

The current heap-based top-k selection logic doesn't handle the edge case where `limit` is 0. If `self.limit` is 0, the implementation will incorrectly return one result instead of an empty vector.

To ensure correctness, please add a check at the beginning of the `run` function to handle this case explicitly:

```rust
if self.limit == 0 {
    return Ok(SparseLogKnnOutput {
        records: Vec::new(),
    });
}
```

File: rust/worker/src/execution/operators/sparse_log_knn.rs
Line: 81

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, but we shouldn't really allow limit=0

@HammadB HammadB self-requested a review September 4, 2025 00:53
@Sicheng-Pan Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from cf9044d to 0f73a79 Compare September 4, 2025 21:02
@Sicheng-Pan Sicheng-Pan force-pushed the 09-02-_enh_unimplement_hash_eq_for_knnquery branch from 9e70931 to 459fc7e Compare September 4, 2025 21:02
Comment on lines +101 to 113
if (max_heap.len() as u32) < self.limit {
max_heap.push(RecordMeasure {
offset_id: log.get_offset_id(),
measure: score,
}));
} else if min_heap
.peek()
.map(|Reverse(record)| record.measure)
.unwrap_or(f32::MIN)
< score
});
} else if score
< max_heap
.peek()
.map(|record| record.measure)
.unwrap_or(f32::MAX)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Logic error in heap management: You changed from a min_heap with Reverse wrapper to a max_heap without Reverse, but the comparison logic is incorrect for the new similarity metric.

With the new metric 1.0 - dot_product, smaller values are better (more similar). You should use a max heap to track the worst (largest) scores and evict them when better (smaller) scores are found.

Current logic:

} else if score < max_heap.peek().map(|record| record.measure).unwrap_or(f32::MAX) {

This will replace elements when the new score is smaller than the heap's maximum, which is correct. However, using BinaryHeap (max heap) means the largest scores stay at the top and get evicted first, which is what we want.

The logic appears correct, but consider adding a comment to clarify that we're using a max heap to track the worst scores for easier understanding.

Context for Agents
[**BestPractice**]

Logic error in heap management: You changed from a `min_heap` with `Reverse` wrapper to a `max_heap` without `Reverse`, but the comparison logic is incorrect for the new similarity metric.

With the new metric `1.0 - dot_product`, **smaller values are better** (more similar). You should use a **max heap** to track the **worst** (largest) scores and evict them when better (smaller) scores are found.

Current logic:
```rust
} else if score < max_heap.peek().map(|record| record.measure).unwrap_or(f32::MAX) {
```

This will replace elements when the new score is smaller than the heap's maximum, which is correct. However, using `BinaryHeap` (max heap) means the largest scores stay at the top and get evicted first, which is what we want.

The logic appears correct, but consider adding a comment to clarify that we're using a max heap to track the worst scores for easier understanding.

File: rust/worker/src/execution/operators/sparse_log_knn.rs
Line: 111

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the description seems inconsistent here. seems to be a suggestion for better comment.

@Sicheng-Pan Sicheng-Pan force-pushed the 09-02-_enh_unimplement_hash_eq_for_knnquery branch from 459fc7e to c156092 Compare September 8, 2025 17:23
@Sicheng-Pan Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from 0f73a79 to bd18a40 Compare September 8, 2025 17:23
Copy link
Contributor

blacksmith-sh bot commented Sep 8, 2025

Summary: 1 successful workflow, 1 pending workflow

Last updated: 2025-09-09 16:52:12 UTC

@Sicheng-Pan Sicheng-Pan force-pushed the 09-02-_enh_unimplement_hash_eq_for_knnquery branch from c156092 to 8c99f63 Compare September 8, 2025 17:59
@Sicheng-Pan Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from bd18a40 to 3671353 Compare September 8, 2025 17:59
@Sicheng-Pan Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from 3671353 to af93d8c Compare September 9, 2025 16:00
@Sicheng-Pan Sicheng-Pan force-pushed the 09-02-_enh_unimplement_hash_eq_for_knnquery branch from 8c99f63 to 17c2358 Compare September 9, 2025 16:00
@Sicheng-Pan Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from af93d8c to f978112 Compare September 9, 2025 16:19
@Sicheng-Pan Sicheng-Pan force-pushed the 09-02-_enh_unimplement_hash_eq_for_knnquery branch from 17c2358 to f2dbbfd Compare September 9, 2025 16:19
Copy link
Contributor Author

Sicheng-Pan commented Sep 9, 2025

Merge activity

  • Sep 9, 4:50 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Sep 9, 4:52 PM UTC: @Sicheng-Pan merged this pull request with Graphite.

@Sicheng-Pan Sicheng-Pan changed the base branch from 09-02-_enh_unimplement_hash_eq_for_knnquery to graphite-base/5406 September 9, 2025 16:51
@Sicheng-Pan Sicheng-Pan changed the base branch from graphite-base/5406 to main September 9, 2025 16:51
@Sicheng-Pan Sicheng-Pan merged commit fc31e6a into main Sep 9, 2025
134 of 233 checks passed
@Sicheng-Pan Sicheng-Pan deleted the 09-03-_enh_update_sparse_vector_similarity_metric branch September 9, 2025 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants