HNSW-based vector index #4578

ray6080 · 2024-12-02T04:56:22Z

Description

This is the first version of the two-layer HNSW-based vector index. We keep two layers, each of which is a graph and is stored as a forward only rel table.

Syntax

Supported syntax are

CALL CREATE_HNSW_INDEX('index_name', 'table_name', 'column_name', ...);

CALL QUERY_HNSW_INDEX('index_name', 'table_name', vector, k, ...) RETURN nn.id, _distance ORDER BY _distance;

CALL DROP_HNSW_INDEX('index_name'， 'table_name');

In this PR, L2, Cosine, L2SQ, and Dot Product are supported as distance functions.

Query Plan

Note that the planning part should be refactored by @andyfengHKU after this PR is merged.

Index creation

The index creation follows a similar way as how COPY REL works. The CREATE_HNSW_INDEX call statement is compiled into three pipelines. The first one constructs in-memory index, which consists of upper and lower layer graphs, and will convert the lower and upper in-memory graphs to flat tuples consisting of src and dst node offsets, respectively, which are shared with the second and third pipelines.
The second and third pipelines are reusing 'RelBatchInsert' to organize tuples from lower and upper layer graphs into node groups and flush them as rel tables.

Note that the above procedure assumes that we creates two rel tables for the upper and lower layer, respectively. This is done through the rewriteFunc, which generates a query string that is executed before the actual index create function.

Also, in this PR, to simplify the design and improve the performance, we cache the whole embedding column from the node table. Note that this can potentially improve the performance by avoiding random reads from disk, but reduces the scalability of the index in the longer term. For example, assume embeddings are not compressed, for FLOAT[1024], a single embedding is 4KB, which is a reasonable size to read from disk, while caching 10M embeddings of such in memory will cost 40GB of memory.

Index search

The index search (QUERY_HNSW_INDEX) is compiled into a hash join. The build side is the table function which does searches in the index, and return k tuples, and the probe side is scan of the base node table. Note that we will force semi mask to be passed from the build side to the probe side to reduce unnecessary scans, assuming k is small. But eventually we should let optimizer to figure out based on cardinality which side is build side and whether to pass semi mask or not.
Also, potentially we will replace the hash join with lookups instead, but not in this first initial PR.

For reviewer:
Pointers to important changes in the PR:

src/function/table/call/hnsw/: implementation of create/query/drop index table functions.
src/include/function/table/hnsw/hnsw_index_functions.h: definition of all hnsw index related table functions.
src/include/storage/index/: core hnsw index data structures and algorithm implementation.

semihsalihoglu-uw

I have several comments throughout but most are small or medium-level. My broader comments are these. To organize my thoughts, let me first list several problems in the current implementation, which I will list first. Then I will propose some changes that I think address them.

Problem 1: HNSWGraph class design: There are several problems here.
Problem 1.1: Some of the functions in HNSWGraph are not used in OnDiskHNSWGraph, which only support searching. So any insertion-related functions are marked as UNREACHABLE but still inherited by OnDiskHNSWGraph, e.g., createDirectedEdge, finalize, shrink etc. You have a similar problem in the HNSWIndex class hiearchy.
Problem 1.2: I am not convinced you need OnDiskHNSWGraph at all. We have an OnDiskGraph implementation that seems to support all of the functionality that you need, which is getting the neighbors of nodes from a rel table. OnDiskGraph is able to do this already and provides an iterator interface.

Problem 2: HNSWIndex design: You seem to have a similar (but opposite) problem as Problem 1.1. How it looks like HNSWIndex contains search-related functions, such as searchUpperLayer and searchLowerLayer, that are not needed by InMemHNSWIndex. In some sense, InMemHNSWIndex also needs search functions for upper and lower layer to do insertions but for InMemHNSWIndex these can be the same function (see my comment in Problem 3 below).

Problem 3: HNSWIndex insert functions: Your InMemHNSWIndex::insertToUpperLayer and InMemHNSWIndex::insertToLowerLayer functions deviate from the design doc. These can actually be the same function that takes in an entry point (the first node).

Problem 3 is a separate problem. It's more like a bug, and not a design decision. For the Problem 1 and 2, I think we can move to the following design:

Remove HNSWGraph class hierarchy and only have an InMemHNSWGraph class: If you can re-use OnDiskGraph as is, we don't need to form a class hierarchy here and can only have an InMemHNSWGraph class. You could consider renaming this to HNSWGraphBuilder because the main functionality of this class is not that it's an in-memory implementation but that it's used for building the HNSWGraph. In general, I find it better to give class and field names based on functionality.
HNSWIndex class hierarchy: Here we could still have the same class hierarchy of HNSWIndex (or BaseHNSWIndex), InMemHNSWIndex (or HNSWIndexBuilder), and OnDiskHNSWIndex. HNSWIndex (or BaseHNSWIndex) would contain common fields such as config Embedding* field. InMemHNSWIndex would have two InMemHNSWGraph fields and coordinate inserting into both lower and upper InMemHNSWGraph (though with a single common search function and correct upper layer insertion algorithm). Finally, OnDiskHNSWIndex would contain two OnDiskGraphs for upper and lower layers and implement the search functionality.

Other than that, I have several concerns about sitting down and designing with Ziyi a long term solution to our binding, planning, and mapping phases of table functions. I'm not clear about the clear roadmap to how we build additional indices in the system as extensions and what interfaces the extension builder implements. I see some rewriting into CREATE REL TABLE statements but then because of how we create rel tables, there is some custom logic that gets deep into the map_copy_rel. That does not seem like how we can create an extension framework. We need to think about clear interfaces and contracts between the core system and extensions that will build indices. I think we can sacrifice some efficiency here if we have to.

Update: As per our discussions, let's also check that we can successfully error on a query like this:

MATCH (a:Docs)
CALL CREATE_HNSW_INDEX( ... )
RETURN *

The RETURN statement might let this query run. Without RETURN the parser apparently should be erroring. With the RETURN, we might need some error checking somewhere.

src/include/function/table/hnsw/hnsw_index_functions.h

src/catalog/catalog_entry/catalog_entry.cpp

semihsalihoglu-uw · 2024-12-18T05:32:09Z

src/include/function/table/hnsw/hnsw_index_functions.h

+    explicit CreateHNSWLocalState(common::offset_t numNodes) : visited{numNodes} {}
+};
+
+struct QueryHNSWIndexBindData final : SimpleTableFuncBindData {


Many of the fields you keep here do not seem to belong to the binding phase. They look like fields you should construct after mapping phase. Can you propose how to clean this up (and possibly also clean up other table functions)? If possible, I would also do this fix in this PR.

I temporarily moved storage level fields into the shared state, which is a fine solution, before I can start to refactor the mapping of table functions. Put a todo item under "refactoring" category for now.

src/function/table/call/hnsw/create_hnsw_index.cpp

src/function/table/call/hnsw/drop_hnsw_index.cpp

src/include/storage/index/hnsw_graph.h

src/include/storage/index/hnsw_index.h

src/include/storage/index/hnsw_graph.h

src/storage/index/hnsw_graph.cpp

…constructor params

…reation

… add searchNN

github-actions · 2025-01-08T17:01:57Z

Benchmark Result

Master commit hash: decc6d5823880dbb9ae7194d12d608decb7ec5a8
Branch commit hash: 341381579a5f124fdd173828f4868d9135985531

Query Group	Query Name	Mean Time - Commit (ms)	Mean Time - Master (ms)	Diff
aggregation	q24	647.77	643.95	3.82 (0.59%)
aggregation	q28	11403.33	11660.75	-257.42 (-2.21%)
copy	node-Comment	72326.84	68451.44	3875.40 (5.66%)
copy	node-Forum	5712.24	5488.48	223.76 (4.08%)
copy	node-Organisation	1169.71	1208.26	-38.55 (-3.19%)
copy	node-Person	2278.55	2139.74	138.81 (6.49%)
copy	node-Place	1189.12	1147.13	41.99 (3.66%)
copy	node-Post	35266.63	30123.40	5143.23 (17.07%)
copy	node-Tag	1307.28	1226.60	80.68 (6.58%)
copy	node-Tagclass	1119.75	1173.21	-53.46 (-4.56%)
copy	rel-comment-hasCreator	57707.15	57910.42	-203.27 (-0.35%)
copy	rel-comment-hasTag	89729.24	90209.68	-480.44 (-0.53%)
copy	rel-comment-isLocatedIn	71032.39	71367.13	-334.74 (-0.47%)
copy	rel-containerOf	14119.24	15333.46	-1214.22 (-7.92%)
copy	rel-forum-hasTag	4053.46	4014.23	39.23 (0.98%)
copy	rel-hasInterest	3100.11	3075.03	25.08 (0.82%)
copy	rel-hasMember	125578.14	123809.85	1768.29 (1.43%)
copy	rel-hasModerator	1299.84	1345.72	-45.88 (-3.41%)
copy	rel-hasType	223.71	284.08	-60.37 (-21.25%)
copy	rel-isPartOf	254.97	325.02	-70.05 (-21.55%)
copy	rel-isSubclassOf	293.80	271.51	22.29 (8.21%)
copy	rel-knows	14147.27	13915.39	231.88 (1.67%)
copy	rel-likes-comment	179303.38	181294.75	-1991.37 (-1.10%)
copy	rel-likes-post	72670.40	67694.13	4976.27 (7.35%)
copy	rel-organisation-isLocatedIn	251.34	244.20	7.14 (2.92%)
copy	rel-person-isLocatedIn	455.79	513.90	-58.11 (-11.31%)
copy	rel-post-hasCreator	14165.07	14630.62	-465.55 (-3.18%)
copy	rel-post-hasTag	22887.12	22502.55	384.57 (1.71%)
copy	rel-post-isLocatedIn	18650.37	18303.12	347.25 (1.90%)
copy	rel-replyOf-comment	49516.54	47388.20	2128.34 (4.49%)
copy	rel-replyOf-post	39201.19	37680.53	1520.66 (4.04%)
copy	rel-studyAt	887.89	761.69	126.20 (16.57%)
copy	rel-workAt	1596.56	1564.97	31.59 (2.02%)
filter	q14	123.61	129.08	-5.47 (-4.24%)
filter	q15	118.60	125.65	-7.05 (-5.61%)
filter	q16	315.43	309.40	6.03 (1.95%)
filter	q17	441.27	446.79	-5.52 (-1.24%)
filter	q18	2079.98	1934.79	145.19 (7.50%)
filter	zonemap-node	82.30	89.14	-6.84 (-7.67%)
filter	zonemap-node-lhs-cast	82.42	89.31	-6.89 (-7.72%)
filter	zonemap-node-null	78.27	85.19	-6.92 (-8.13%)
filter	zonemap-rel	5972.73	5756.87	215.86 (3.75%)
fixed_size_expr_evaluator	q07	571.95	576.03	-4.08 (-0.71%)
fixed_size_expr_evaluator	q08	800.94	803.63	-2.68 (-0.33%)
fixed_size_expr_evaluator	q09	801.71	807.16	-5.45 (-0.68%)
fixed_size_expr_evaluator	q10	238.50	239.15	-0.65 (-0.27%)
fixed_size_expr_evaluator	q11	229.08	233.61	-4.53 (-1.94%)
fixed_size_expr_evaluator	q12	226.14	226.41	-0.26 (-0.12%)
fixed_size_expr_evaluator	q13	1450.67	1459.02	-8.35 (-0.57%)
fixed_size_seq_scan	q23	115.75	108.29	7.45 (6.88%)
join	q29	627.83	617.79	10.04 (1.62%)
join	q30	10836.72	10120.48	716.24 (7.08%)
join	q31	7.61	7.32	0.29 (4.01%)
join	SelectiveTwoHopJoin	54.90	52.58	2.32 (4.41%)
ldbc_snb_ic	q35	2602.47	2506.95	95.52 (3.81%)
ldbc_snb_ic	q36	490.54	469.19	21.35 (4.55%)
ldbc_snb_is	q32	4.30	3.87	0.44 (11.27%)
ldbc_snb_is	q33	15.70	13.72	1.98 (14.40%)
ldbc_snb_is	q34	1.22	1.50	-0.28 (-18.42%)
multi-rel	multi-rel-large-scan	1410.80	1302.00	108.81 (8.36%)
multi-rel	multi-rel-lookup	15.41	24.38	-8.97 (-36.79%)
multi-rel	multi-rel-small-scan	64.17	76.69	-12.52 (-16.33%)
order_by	q25	126.52	129.16	-2.64 (-2.04%)
order_by	q26	451.31	448.84	2.47 (0.55%)
order_by	q27	1524.52	1456.60	67.92 (4.66%)
recursive_join	recursive-join-bidirection	281.40	281.94	-0.54 (-0.19%)
recursive_join	recursive-join-dense	7430.68	7322.47	108.22 (1.48%)
recursive_join	recursive-join-path	24011.16	24035.02	-23.86 (-0.10%)
recursive_join	recursive-join-sparse	1077.10	1056.00	21.10 (2.00%)
recursive_join	recursive-join-trail	7426.91	7294.89	132.02 (1.81%)
scan_after_filter	q01	169.41	170.53	-1.12 (-0.66%)
scan_after_filter	q02	152.08	155.05	-2.97 (-1.92%)
shortest_path_ldbc100	q37	80.19	90.35	-10.15 (-11.24%)
shortest_path_ldbc100	q38	353.54	371.71	-18.18 (-4.89%)
shortest_path_ldbc100	q39	68.36	61.64	6.72 (10.90%)
shortest_path_ldbc100	q40	393.71	400.90	-7.19 (-1.79%)
var_size_expr_evaluator	q03	2064.37	2059.42	4.95 (0.24%)
var_size_expr_evaluator	q04	2325.16	2222.65	102.51 (4.61%)
var_size_expr_evaluator	q05	2627.63	2606.73	20.89 (0.80%)
var_size_expr_evaluator	q06	1337.74	1317.42	20.33 (1.54%)
var_size_seq_scan	q19	1472.32	1444.53	27.79 (1.92%)
var_size_seq_scan	q20	2700.18	2679.22	20.96 (0.78%)
var_size_seq_scan	q21	2297.79	2317.01	-19.22 (-0.83%)
var_size_seq_scan	q22	127.12	126.87	0.25 (0.19%)

codecov · 2025-01-08T17:11:16Z

Codecov Report

Attention: Patch coverage is 89.38776% with 104 lines in your changes missing coverage. Please review.

Project coverage is 86.31%. Comparing base (080cbc0) to head (252142c).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...catalog/catalog_entry/hnsw_index_catalog_entry.cpp	45.00%	22 Missing ⚠️
src/storage/index/hnsw_config.cpp	80.00%	21 Missing ⚠️
src/storage/index/index_utils.cpp	74.28%	9 Missing ⚠️
test/test_runner/test_runner.cpp	0.00%	6 Missing ⚠️
src/function/table/call/hnsw/drop_hnsw_index.cpp	85.29%	5 Missing ⚠️
...include/function/table/hnsw/hnsw_index_functions.h	76.19%	5 Missing ⚠️
src/include/storage/index/hnsw_index.h	80.00%	5 Missing ⚠️
src/storage/index/hnsw_index_utils.cpp	83.87%	5 Missing ⚠️
src/catalog/catalog_entry/catalog_entry.cpp	0.00%	3 Missing ⚠️
src/include/storage/index/hnsw_graph.h	86.36%	3 Missing ⚠️
... and 12 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4578      +/-   ##
==========================================
+ Coverage   86.22%   86.31%   +0.08%     
==========================================
  Files        1369     1384      +15     
  Lines       58232    59127     +895     
  Branches     7206     7288      +82     
==========================================
+ Hits        50213    51034     +821     
- Misses       7855     7929      +74     
  Partials      164      164

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ray6080 force-pushed the vector-index branch 3 times, most recently from 67ac29a to 303a0af Compare December 4, 2024 13:16

ray6080 force-pushed the vector-index branch 2 times, most recently from 14fdae1 to 1d2d1c8 Compare December 11, 2024 14:57

ray6080 changed the title ~~Prototype of vector index~~ HNSW-based vector index Dec 11, 2024

ray6080 force-pushed the vector-index branch 3 times, most recently from 787389e to 6993586 Compare December 13, 2024 06:22

ray6080 marked this pull request as ready for review December 13, 2024 13:23

ray6080 requested a review from benjaminwinger as a code owner December 13, 2024 13:23

ray6080 force-pushed the vector-index branch from 3a7bb60 to da37c31 Compare December 16, 2024 12:04

ray6080 requested review from semihsalihoglu-uw and removed request for benjaminwinger December 17, 2024 04:42

semihsalihoglu-uw requested changes Dec 19, 2024

View reviewed changes

ray6080 force-pushed the vector-index branch 3 times, most recently from c51317f to 524370e Compare January 7, 2025 08:10

ray6080 and others added 10 commits January 8, 2025 10:28

hnsw-based vector index

cb1f3ad

refactor fts and hnsw index validation; refactor hnsw index function …

128a4f4

…constructor params

refactor class hierarchy and fix upper graph insertion during index c…

6836a17

…reation

a bit more refactoring

9252717

remove unnecessary tmp file write in create_fts_index

0a2fdd4

more

b8e9638

add comment for getNeighbors

6d18ebb

Run clang-format

8878cc4

address clang tidy

692a544

a bit more refactoring on search hnsw index; change output to double;…

a53ba85

… add searchNN

ray6080 force-pushed the vector-index branch from 5331982 to a53ba85 Compare January 8, 2025 02:46

simplify test cases

badba17

fix test failure

252142c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HNSW-based vector index #4578

HNSW-based vector index #4578

ray6080 commented Dec 2, 2024 •

edited

Loading

semihsalihoglu-uw left a comment •

edited

Loading

semihsalihoglu-uw Dec 18, 2024

ray6080 Jan 7, 2025

github-actions bot commented Jan 8, 2025

codecov bot commented Jan 8, 2025

HNSW-based vector index #4578

Are you sure you want to change the base?

HNSW-based vector index #4578

Conversation

ray6080 commented Dec 2, 2024 • edited Loading

Description

Syntax

Query Plan

Index creation

Index search

semihsalihoglu-uw left a comment • edited Loading

Choose a reason for hiding this comment

semihsalihoglu-uw Dec 18, 2024

Choose a reason for hiding this comment

ray6080 Jan 7, 2025

Choose a reason for hiding this comment

github-actions bot commented Jan 8, 2025

Benchmark Result

codecov bot commented Jan 8, 2025

Codecov Report

ray6080 commented Dec 2, 2024 •

edited

Loading

semihsalihoglu-uw left a comment •

edited

Loading