Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HNSW-based vector index #4578

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

HNSW-based vector index #4578

wants to merge 12 commits into from

Conversation

ray6080
Copy link
Contributor

@ray6080 ray6080 commented Dec 2, 2024

Description

This is the first version of the two-layer HNSW-based vector index. We keep two layers, each of which is a graph and is stored as a forward only rel table.

Syntax

Supported syntax are

CALL CREATE_HNSW_INDEX('index_name', 'table_name', 'column_name', ...);

CALL QUERY_HNSW_INDEX('index_name', 'table_name', vector, k, ...) RETURN nn.id, _distance ORDER BY _distance;

CALL DROP_HNSW_INDEX('index_name', 'table_name');

In this PR, L2, Cosine, L2SQ, and Dot Product are supported as distance functions.

Query Plan

Note that the planning part should be refactored by @andyfengHKU after this PR is merged.

Index creation

The index creation follows a similar way as how COPY REL works. The CREATE_HNSW_INDEX call statement is compiled into three pipelines. The first one constructs in-memory index, which consists of upper and lower layer graphs, and will convert the lower and upper in-memory graphs to flat tuples consisting of src and dst node offsets, respectively, which are shared with the second and third pipelines.
The second and third pipelines are reusing 'RelBatchInsert' to organize tuples from lower and upper layer graphs into node groups and flush them as rel tables.

Note that the above procedure assumes that we creates two rel tables for the upper and lower layer, respectively. This is done through the rewriteFunc, which generates a query string that is executed before the actual index create function.

Also, in this PR, to simplify the design and improve the performance, we cache the whole embedding column from the node table. Note that this can potentially improve the performance by avoiding random reads from disk, but reduces the scalability of the index in the longer term. For example, assume embeddings are not compressed, for FLOAT[1024], a single embedding is 4KB, which is a reasonable size to read from disk, while caching 10M embeddings of such in memory will cost 40GB of memory.

Index search

The index search (QUERY_HNSW_INDEX) is compiled into a hash join. The build side is the table function which does searches in the index, and return k tuples, and the probe side is scan of the base node table. Note that we will force semi mask to be passed from the build side to the probe side to reduce unnecessary scans, assuming k is small. But eventually we should let optimizer to figure out based on cardinality which side is build side and whether to pass semi mask or not.
Also, potentially we will replace the hash join with lookups instead, but not in this first initial PR.


For reviewer:
Pointers to important changes in the PR:

  • src/function/table/call/hnsw/: implementation of create/query/drop index table functions.
  • src/include/function/table/hnsw/hnsw_index_functions.h: definition of all hnsw index related table functions.
  • src/include/storage/index/: core hnsw index data structures and algorithm implementation.

@ray6080 ray6080 force-pushed the vector-index branch 3 times, most recently from 67ac29a to 303a0af Compare December 4, 2024 13:16
@ray6080 ray6080 force-pushed the vector-index branch 2 times, most recently from 14fdae1 to 1d2d1c8 Compare December 11, 2024 14:57
@ray6080 ray6080 changed the title Prototype of vector index HNSW-based vector index Dec 11, 2024
@ray6080 ray6080 force-pushed the vector-index branch 3 times, most recently from 787389e to 6993586 Compare December 13, 2024 06:22
@ray6080 ray6080 marked this pull request as ready for review December 13, 2024 13:23
@ray6080 ray6080 requested review from semihsalihoglu-uw and removed request for benjaminwinger December 17, 2024 04:42
Copy link
Contributor

@semihsalihoglu-uw semihsalihoglu-uw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have several comments throughout but most are small or medium-level. My broader comments are these. To organize my thoughts, let me first list several problems in the current implementation, which I will list first. Then I will propose some changes that I think address them.

Problem 1: HNSWGraph class design: There are several problems here.
Problem 1.1: Some of the functions in HNSWGraph are not used in OnDiskHNSWGraph, which only support searching. So any insertion-related functions are marked as UNREACHABLE but still inherited by OnDiskHNSWGraph, e.g., createDirectedEdge, finalize, shrink etc. You have a similar problem in the HNSWIndex class hiearchy.
Problem 1.2: I am not convinced you need OnDiskHNSWGraph at all. We have an OnDiskGraph implementation that seems to support all of the functionality that you need, which is getting the neighbors of nodes from a rel table. OnDiskGraph is able to do this already and provides an iterator interface.

Problem 2: HNSWIndex design: You seem to have a similar (but opposite) problem as Problem 1.1. How it looks like HNSWIndex contains search-related functions, such as searchUpperLayer and searchLowerLayer, that are not needed by InMemHNSWIndex. In some sense, InMemHNSWIndex also needs search functions for upper and lower layer to do insertions but for InMemHNSWIndex these can be the same function (see my comment in Problem 3 below).

Problem 3: HNSWIndex insert functions: Your InMemHNSWIndex::insertToUpperLayer and InMemHNSWIndex::insertToLowerLayer functions deviate from the design doc. These can actually be the same function that takes in an entry point (the first node).

Problem 3 is a separate problem. It's more like a bug, and not a design decision. For the Problem 1 and 2, I think we can move to the following design:

  • Remove HNSWGraph class hierarchy and only have an InMemHNSWGraph class: If you can re-use OnDiskGraph as is, we don't need to form a class hierarchy here and can only have an InMemHNSWGraph class. You could consider renaming this to HNSWGraphBuilder because the main functionality of this class is not that it's an in-memory implementation but that it's used for building the HNSWGraph. In general, I find it better to give class and field names based on functionality.
  • HNSWIndex class hierarchy: Here we could still have the same class hierarchy of HNSWIndex (or BaseHNSWIndex), InMemHNSWIndex (or HNSWIndexBuilder), and OnDiskHNSWIndex. HNSWIndex (or BaseHNSWIndex) would contain common fields such as config Embedding* field. InMemHNSWIndex would have two InMemHNSWGraph fields and coordinate inserting into both lower and upper InMemHNSWGraph (though with a single common search function and correct upper layer insertion algorithm). Finally, OnDiskHNSWIndex would contain two OnDiskGraphs for upper and lower layers and implement the search functionality.

Other than that, I have several concerns about sitting down and designing with Ziyi a long term solution to our binding, planning, and mapping phases of table functions. I'm not clear about the clear roadmap to how we build additional indices in the system as extensions and what interfaces the extension builder implements. I see some rewriting into CREATE REL TABLE statements but then because of how we create rel tables, there is some custom logic that gets deep into the map_copy_rel. That does not seem like how we can create an extension framework. We need to think about clear interfaces and contracts between the core system and extensions that will build indices. I think we can sacrifice some efficiency here if we have to.

Update: As per our discussions, let's also check that we can successfully error on a query like this:

MATCH (a:Docs)
CALL CREATE_HNSW_INDEX( ... )
RETURN *

The RETURN statement might let this query run. Without RETURN the parser apparently should be erroring. With the RETURN, we might need some error checking somewhere.

src/include/function/table/hnsw/hnsw_index_functions.h Outdated Show resolved Hide resolved
src/catalog/catalog_entry/catalog_entry.cpp Outdated Show resolved Hide resolved
explicit CreateHNSWLocalState(common::offset_t numNodes) : visited{numNodes} {}
};

struct QueryHNSWIndexBindData final : SimpleTableFuncBindData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the fields you keep here do not seem to belong to the binding phase. They look like fields you should construct after mapping phase. Can you propose how to clean this up (and possibly also clean up other table functions)? If possible, I would also do this fix in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I temporarily moved storage level fields into the shared state, which is a fine solution, before I can start to refactor the mapping of table functions. Put a todo item under "refactoring" category for now.

src/function/table/call/hnsw/create_hnsw_index.cpp Outdated Show resolved Hide resolved
src/function/table/call/hnsw/drop_hnsw_index.cpp Outdated Show resolved Hide resolved
src/include/storage/index/hnsw_graph.h Outdated Show resolved Hide resolved
src/include/storage/index/hnsw_index.h Outdated Show resolved Hide resolved
src/include/storage/index/hnsw_index.h Outdated Show resolved Hide resolved
src/include/storage/index/hnsw_graph.h Outdated Show resolved Hide resolved
src/storage/index/hnsw_graph.cpp Show resolved Hide resolved
@ray6080 ray6080 force-pushed the vector-index branch 3 times, most recently from c51317f to 524370e Compare January 7, 2025 08:10
Copy link

github-actions bot commented Jan 8, 2025

Benchmark Result

Master commit hash: decc6d5823880dbb9ae7194d12d608decb7ec5a8
Branch commit hash: 341381579a5f124fdd173828f4868d9135985531

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 647.77 643.95 3.82 (0.59%)
aggregation q28 11403.33 11660.75 -257.42 (-2.21%)
copy node-Comment 72326.84 68451.44 3875.40 (5.66%)
copy node-Forum 5712.24 5488.48 223.76 (4.08%)
copy node-Organisation 1169.71 1208.26 -38.55 (-3.19%)
copy node-Person 2278.55 2139.74 138.81 (6.49%)
copy node-Place 1189.12 1147.13 41.99 (3.66%)
copy node-Post 35266.63 30123.40 5143.23 (17.07%)
copy node-Tag 1307.28 1226.60 80.68 (6.58%)
copy node-Tagclass 1119.75 1173.21 -53.46 (-4.56%)
copy rel-comment-hasCreator 57707.15 57910.42 -203.27 (-0.35%)
copy rel-comment-hasTag 89729.24 90209.68 -480.44 (-0.53%)
copy rel-comment-isLocatedIn 71032.39 71367.13 -334.74 (-0.47%)
copy rel-containerOf 14119.24 15333.46 -1214.22 (-7.92%)
copy rel-forum-hasTag 4053.46 4014.23 39.23 (0.98%)
copy rel-hasInterest 3100.11 3075.03 25.08 (0.82%)
copy rel-hasMember 125578.14 123809.85 1768.29 (1.43%)
copy rel-hasModerator 1299.84 1345.72 -45.88 (-3.41%)
copy rel-hasType 223.71 284.08 -60.37 (-21.25%)
copy rel-isPartOf 254.97 325.02 -70.05 (-21.55%)
copy rel-isSubclassOf 293.80 271.51 22.29 (8.21%)
copy rel-knows 14147.27 13915.39 231.88 (1.67%)
copy rel-likes-comment 179303.38 181294.75 -1991.37 (-1.10%)
copy rel-likes-post 72670.40 67694.13 4976.27 (7.35%)
copy rel-organisation-isLocatedIn 251.34 244.20 7.14 (2.92%)
copy rel-person-isLocatedIn 455.79 513.90 -58.11 (-11.31%)
copy rel-post-hasCreator 14165.07 14630.62 -465.55 (-3.18%)
copy rel-post-hasTag 22887.12 22502.55 384.57 (1.71%)
copy rel-post-isLocatedIn 18650.37 18303.12 347.25 (1.90%)
copy rel-replyOf-comment 49516.54 47388.20 2128.34 (4.49%)
copy rel-replyOf-post 39201.19 37680.53 1520.66 (4.04%)
copy rel-studyAt 887.89 761.69 126.20 (16.57%)
copy rel-workAt 1596.56 1564.97 31.59 (2.02%)
filter q14 123.61 129.08 -5.47 (-4.24%)
filter q15 118.60 125.65 -7.05 (-5.61%)
filter q16 315.43 309.40 6.03 (1.95%)
filter q17 441.27 446.79 -5.52 (-1.24%)
filter q18 2079.98 1934.79 145.19 (7.50%)
filter zonemap-node 82.30 89.14 -6.84 (-7.67%)
filter zonemap-node-lhs-cast 82.42 89.31 -6.89 (-7.72%)
filter zonemap-node-null 78.27 85.19 -6.92 (-8.13%)
filter zonemap-rel 5972.73 5756.87 215.86 (3.75%)
fixed_size_expr_evaluator q07 571.95 576.03 -4.08 (-0.71%)
fixed_size_expr_evaluator q08 800.94 803.63 -2.68 (-0.33%)
fixed_size_expr_evaluator q09 801.71 807.16 -5.45 (-0.68%)
fixed_size_expr_evaluator q10 238.50 239.15 -0.65 (-0.27%)
fixed_size_expr_evaluator q11 229.08 233.61 -4.53 (-1.94%)
fixed_size_expr_evaluator q12 226.14 226.41 -0.26 (-0.12%)
fixed_size_expr_evaluator q13 1450.67 1459.02 -8.35 (-0.57%)
fixed_size_seq_scan q23 115.75 108.29 7.45 (6.88%)
join q29 627.83 617.79 10.04 (1.62%)
join q30 10836.72 10120.48 716.24 (7.08%)
join q31 7.61 7.32 0.29 (4.01%)
join SelectiveTwoHopJoin 54.90 52.58 2.32 (4.41%)
ldbc_snb_ic q35 2602.47 2506.95 95.52 (3.81%)
ldbc_snb_ic q36 490.54 469.19 21.35 (4.55%)
ldbc_snb_is q32 4.30 3.87 0.44 (11.27%)
ldbc_snb_is q33 15.70 13.72 1.98 (14.40%)
ldbc_snb_is q34 1.22 1.50 -0.28 (-18.42%)
multi-rel multi-rel-large-scan 1410.80 1302.00 108.81 (8.36%)
multi-rel multi-rel-lookup 15.41 24.38 -8.97 (-36.79%)
multi-rel multi-rel-small-scan 64.17 76.69 -12.52 (-16.33%)
order_by q25 126.52 129.16 -2.64 (-2.04%)
order_by q26 451.31 448.84 2.47 (0.55%)
order_by q27 1524.52 1456.60 67.92 (4.66%)
recursive_join recursive-join-bidirection 281.40 281.94 -0.54 (-0.19%)
recursive_join recursive-join-dense 7430.68 7322.47 108.22 (1.48%)
recursive_join recursive-join-path 24011.16 24035.02 -23.86 (-0.10%)
recursive_join recursive-join-sparse 1077.10 1056.00 21.10 (2.00%)
recursive_join recursive-join-trail 7426.91 7294.89 132.02 (1.81%)
scan_after_filter q01 169.41 170.53 -1.12 (-0.66%)
scan_after_filter q02 152.08 155.05 -2.97 (-1.92%)
shortest_path_ldbc100 q37 80.19 90.35 -10.15 (-11.24%)
shortest_path_ldbc100 q38 353.54 371.71 -18.18 (-4.89%)
shortest_path_ldbc100 q39 68.36 61.64 6.72 (10.90%)
shortest_path_ldbc100 q40 393.71 400.90 -7.19 (-1.79%)
var_size_expr_evaluator q03 2064.37 2059.42 4.95 (0.24%)
var_size_expr_evaluator q04 2325.16 2222.65 102.51 (4.61%)
var_size_expr_evaluator q05 2627.63 2606.73 20.89 (0.80%)
var_size_expr_evaluator q06 1337.74 1317.42 20.33 (1.54%)
var_size_seq_scan q19 1472.32 1444.53 27.79 (1.92%)
var_size_seq_scan q20 2700.18 2679.22 20.96 (0.78%)
var_size_seq_scan q21 2297.79 2317.01 -19.22 (-0.83%)
var_size_seq_scan q22 127.12 126.87 0.25 (0.19%)

Copy link

codecov bot commented Jan 8, 2025

Codecov Report

Attention: Patch coverage is 89.38776% with 104 lines in your changes missing coverage. Please review.

Project coverage is 86.31%. Comparing base (080cbc0) to head (252142c).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...catalog/catalog_entry/hnsw_index_catalog_entry.cpp 45.00% 22 Missing ⚠️
src/storage/index/hnsw_config.cpp 80.00% 21 Missing ⚠️
src/storage/index/index_utils.cpp 74.28% 9 Missing ⚠️
test/test_runner/test_runner.cpp 0.00% 6 Missing ⚠️
src/function/table/call/hnsw/drop_hnsw_index.cpp 85.29% 5 Missing ⚠️
...include/function/table/hnsw/hnsw_index_functions.h 76.19% 5 Missing ⚠️
src/include/storage/index/hnsw_index.h 80.00% 5 Missing ⚠️
src/storage/index/hnsw_index_utils.cpp 83.87% 5 Missing ⚠️
src/catalog/catalog_entry/catalog_entry.cpp 0.00% 3 Missing ⚠️
src/include/storage/index/hnsw_graph.h 86.36% 3 Missing ⚠️
... and 12 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4578      +/-   ##
==========================================
+ Coverage   86.22%   86.31%   +0.08%     
==========================================
  Files        1369     1384      +15     
  Lines       58232    59127     +895     
  Branches     7206     7288      +82     
==========================================
+ Hits        50213    51034     +821     
- Misses       7855     7929      +74     
  Partials      164      164              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants