[ENH] Update sparse vector similarity metric #5406

Sicheng-Pan · 2025-09-03T22:09:08Z

Description of changes

Summarize the changes made by this PR.

Improvements & Bug fixes
- Change sparse vector metric from inner product to one minus inner product. This is consistent with our dense vector similarity metric.
- Updates Rank operator so that the results are returned in increasing score. Smaller score means higher similarity.
- Removes SparseKnnMerge implementation as it is unnecessary now
New functionality
- N/A

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Sicheng-Pan · 2025-09-03T22:09:25Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2025-09-03T22:09:30Z

propel-code-bot · 2025-09-03T22:12:00Z

Unify Sparse Vector Similarity Metric and Operator Logic (Inner Product to 1 - Inner Product, Remove Obsolete Merge Code)

This PR standardizes the similarity metric for sparse vectors by changing it from 'inner product' to '1 - inner product' throughout the codebase, aligning with the metric used for dense vectors. It refactors and updates impacted operators, orchestration code, and tests, removes the now-unnecessary SparseKnnMerge implementation, and ensures that ranking and merging logic returns results in ascending order (where smaller values represent higher similarity). The refactor also updates all associated function signatures, test logic, comments, and documentation to match the new convention.

Key Changes

• Similarity metric for sparse vector search changed from inner product to 1 - inner product for consistency with dense vector logic.
• Rank operator updated: output is now sorted in increasing score (smaller values indicate higher similarity).
• Obsolete SparseKnnMerge operator and related types/tests removed; the generic KnnMerge is used instead for both dense and sparse cases.
• All operator, handler, and orchestration code (including Knn, Spann, and SparseKnn orchestrators) updated to use new naming and measure semantics.
• All affected unit and integration tests updated for new similarity metric and result ordering.
• Documentation and comments revised to clarify meaning of similarity values under the new metric.
• Naming across struct and function signatures unified: e.g., distances renamed to measures, and batch_distances to batch_measures.
• Refactored heap-based selection logic in sparse search to correctly identify top-k in the new similarity space.

Affected Areas

• Sparse vector search operators: sparse_log_knn.rs, sparse_index_knn.rs
• Rank operator (rank.rs)
• Merge logic: knn_merge.rs (now used for sparse and dense search)
• SparseKnn and SpannKnn orchestrators
• Test and benchmark code involving sparse similarity
• Operator module registry
• General documentation and tests related to similarity computation

This summary was automatically generated by @propel-code-bot

rust/worker/src/execution/operators/sparse_log_knn.rs

propel-code-bot · 2025-09-03T22:35:11Z

rust/worker/src/execution/operators/sparse_log_knn.rs

@@ -78,7 +78,7 @@ impl Operator<SparseLogKnnInput, SparseLogKnnOutput> for SparseLogKnn {

        let logs = materialize_logs(&record_segment_reader, input.logs.clone(), None).await?;

-        let mut min_heap = BinaryHeap::with_capacity(self.limit as usize);
+        let mut max_heap = BinaryHeap::with_capacity(self.limit as usize);


[BestPractice]

The current heap-based top-k selection logic doesn't handle the edge case where limit is 0. If self.limit is 0, the implementation will incorrectly return one result instead of an empty vector.

To ensure correctness, please add a check at the beginning of the run function to handle this case explicitly:

if self.limit == 0 { return Ok(SparseLogKnnOutput { records: Vec::new(), }); }

Context for Agents

[**BestPractice**] The current heap-based top-k selection logic doesn't handle the edge case where `limit` is 0. If `self.limit` is 0, the implementation will incorrectly return one result instead of an empty vector. To ensure correctness, please add a check at the beginning of the `run` function to handle this case explicitly: ```rust if self.limit == 0 { return Ok(SparseLogKnnOutput { records: Vec::new(), }); } ``` File: rust/worker/src/execution/operators/sparse_log_knn.rs Line: 81

Nice catch, but we shouldn't really allow limit=0

propel-code-bot · 2025-09-04T21:04:26Z

rust/worker/src/execution/operators/sparse_log_knn.rs

+                if (max_heap.len() as u32) < self.limit {
+                    max_heap.push(RecordMeasure {
                        offset_id: log.get_offset_id(),
                        measure: score,
-                    }));
-                } else if min_heap
-                    .peek()
-                    .map(|Reverse(record)| record.measure)
-                    .unwrap_or(f32::MIN)
-                    < score
+                    });
+                } else if score
+                    < max_heap
+                        .peek()
+                        .map(|record| record.measure)
+                        .unwrap_or(f32::MAX)
                {


[BestPractice]

Logic error in heap management: You changed from a min_heap with Reverse wrapper to a max_heap without Reverse, but the comparison logic is incorrect for the new similarity metric.

With the new metric 1.0 - dot_product, smaller values are better (more similar). You should use a max heap to track the worst (largest) scores and evict them when better (smaller) scores are found.

Current logic:

} else if score < max_heap.peek().map(|record| record.measure).unwrap_or(f32::MAX) {

This will replace elements when the new score is smaller than the heap's maximum, which is correct. However, using BinaryHeap (max heap) means the largest scores stay at the top and get evicted first, which is what we want.

The logic appears correct, but consider adding a comment to clarify that we're using a max heap to track the worst scores for easier understanding.

Context for Agents

[**BestPractice**] Logic error in heap management: You changed from a `min_heap` with `Reverse` wrapper to a `max_heap` without `Reverse`, but the comparison logic is incorrect for the new similarity metric. With the new metric `1.0 - dot_product`, **smaller values are better** (more similar). You should use a **max heap** to track the **worst** (largest) scores and evict them when better (smaller) scores are found. Current logic: ```rust } else if score < max_heap.peek().map(|record| record.measure).unwrap_or(f32::MAX) { ``` This will replace elements when the new score is smaller than the heap's maximum, which is correct. However, using `BinaryHeap` (max heap) means the largest scores stay at the top and get evicted first, which is what we want. The logic appears correct, but consider adding a comment to clarify that we're using a max heap to track the worst scores for easier understanding. File: rust/worker/src/execution/operators/sparse_log_knn.rs Line: 111

the description seems inconsistent here. seems to be a suggestion for better comment.

blacksmith-sh · 2025-09-08T17:28:12Z

Summary: 1 successful workflow, 1 pending workflow

✅ Check PR Title (1 job succeeded)
🔄 PR checks (52 jobs succeeded, 2 jobs pending)

Last updated: 2025-09-09 16:52:12 UTC

Sicheng-Pan · 2025-09-09T16:50:55Z

Merge activity

Sep 9, 4:50 PM UTC: A user started a stack merge that includes this pull request via Graphite.
Sep 9, 4:52 PM UTC: @Sicheng-Pan merged this pull request with Graphite.

Sicheng-Pan mentioned this pull request Sep 3, 2025

[ENH] Unimplement Hash + Eq for KnnQuery #5397

Merged

1 task

Sicheng-Pan marked this pull request as ready for review September 3, 2025 22:11

propel-code-bot bot reviewed Sep 3, 2025

View reviewed changes

rust/worker/src/execution/operators/sparse_log_knn.rs Outdated Show resolved Hide resolved

Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from f5acb7f to cf9044d Compare September 3, 2025 22:32

propel-code-bot bot reviewed Sep 3, 2025

View reviewed changes

HammadB self-requested a review September 4, 2025 00:53

Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from cf9044d to 0f73a79 Compare September 4, 2025 21:02

Sicheng-Pan force-pushed the 09-02-_enh_unimplement_hash_eq_for_knnquery branch from 9e70931 to 459fc7e Compare September 4, 2025 21:02

Sicheng-Pan mentioned this pull request Sep 4, 2025

[ENH] Transpose search api response #5414

Merged

1 task

propel-code-bot bot reviewed Sep 4, 2025

View reviewed changes

HammadB approved these changes Sep 8, 2025

View reviewed changes

Sicheng-Pan force-pushed the 09-02-_enh_unimplement_hash_eq_for_knnquery branch from 459fc7e to c156092 Compare September 8, 2025 17:23

Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from 0f73a79 to bd18a40 Compare September 8, 2025 17:23

Sicheng-Pan force-pushed the 09-02-_enh_unimplement_hash_eq_for_knnquery branch from c156092 to 8c99f63 Compare September 8, 2025 17:59

Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from bd18a40 to 3671353 Compare September 8, 2025 17:59

Sicheng-Pan mentioned this pull request Sep 8, 2025

[ENH] Make rank expr operational in search api #5429

Open

1 task

Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from 3671353 to af93d8c Compare September 9, 2025 16:00

Sicheng-Pan force-pushed the 09-02-_enh_unimplement_hash_eq_for_knnquery branch from 8c99f63 to 17c2358 Compare September 9, 2025 16:00

Sicheng-Pan added 3 commits September 9, 2025 09:19

[ENH] Unimplement Hash + Eq for KnnQuery

f2dbbfd

[ENH] Update sparse vector similarity metric

d2e296a

Update unit tests

f978112

Sicheng-Pan force-pushed the 09-03-_enh_update_sparse_vector_similarity_metric branch from af93d8c to f978112 Compare September 9, 2025 16:19

Sicheng-Pan force-pushed the 09-02-_enh_unimplement_hash_eq_for_knnquery branch from 17c2358 to f2dbbfd Compare September 9, 2025 16:19

Sicheng-Pan changed the base branch from 09-02-_enh_unimplement_hash_eq_for_knnquery to graphite-base/5406 September 9, 2025 16:51

Sicheng-Pan changed the base branch from graphite-base/5406 to main September 9, 2025 16:51

Sicheng-Pan merged commit fc31e6a into main Sep 9, 2025
134 of 233 checks passed

Sicheng-Pan deleted the 09-03-_enh_update_sparse_vector_similarity_metric branch September 9, 2025 16:52

This was referenced Sep 9, 2025

[ENH] Implement idf modifier for BM25 index #5442

Open

[TST] Extend property test for seach endpoint #5443

Open

[ENH] Search API cleanup #5451

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] Update sparse vector similarity metric #5406

[ENH] Update sparse vector similarity metric #5406

Sicheng-Pan commented Sep 3, 2025 •

edited

Loading

Uh oh!

Sicheng-Pan commented Sep 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 3, 2025

Uh oh!

propel-code-bot bot commented Sep 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

propel-code-bot bot Sep 3, 2025

Uh oh!

Sicheng-Pan Sep 4, 2025

Uh oh!

propel-code-bot bot Sep 4, 2025

Uh oh!

Sicheng-Pan Sep 8, 2025

Uh oh!

blacksmith-sh bot commented Sep 8, 2025 •

edited by Sicheng-Pan

Loading

Uh oh!

Sicheng-Pan commented Sep 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[ENH] Update sparse vector similarity metric #5406

[ENH] Update sparse vector similarity metric #5406

Conversation

Sicheng-Pan commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Migration plan

Observability plan

Documentation Changes

Uh oh!

Sicheng-Pan commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 3, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

propel-code-bot bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Sicheng-Pan Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Sicheng-Pan Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

blacksmith-sh bot commented Sep 8, 2025 • edited by Sicheng-Pan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sicheng-Pan commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Uh oh!

Sicheng-Pan commented Sep 3, 2025 •

edited

Loading

Sicheng-Pan commented Sep 3, 2025 •

edited

Loading

propel-code-bot bot commented Sep 3, 2025 •

edited

Loading

blacksmith-sh bot commented Sep 8, 2025 •

edited by Sicheng-Pan

Loading

Sicheng-Pan commented Sep 9, 2025 •

edited

Loading