-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HNSW-based vector index #4578
base: master
Are you sure you want to change the base?
HNSW-based vector index #4578
Conversation
67ac29a
to
303a0af
Compare
14fdae1
to
1d2d1c8
Compare
787389e
to
6993586
Compare
3a7bb60
to
da37c31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have several comments throughout but most are small or medium-level. My broader comments are these. To organize my thoughts, let me first list several problems in the current implementation, which I will list first. Then I will propose some changes that I think address them.
Problem 1: HNSWGraph class design: There are several problems here.
Problem 1.1: Some of the functions in HNSWGraph are not used in OnDiskHNSWGraph, which only support searching. So any insertion-related functions are marked as UNREACHABLE but still inherited by OnDiskHNSWGraph, e.g., createDirectedEdge, finalize, shrink etc. You have a similar problem in the HNSWIndex class hiearchy.
Problem 1.2: I am not convinced you need OnDiskHNSWGraph at all. We have an OnDiskGraph implementation that seems to support all of the functionality that you need, which is getting the neighbors of nodes from a rel table. OnDiskGraph is able to do this already and provides an iterator interface.
Problem 2: HNSWIndex design: You seem to have a similar (but opposite) problem as Problem 1.1. How it looks like HNSWIndex contains search-related functions, such as searchUpperLayer and searchLowerLayer, that are not needed by InMemHNSWIndex. In some sense, InMemHNSWIndex also needs search functions for upper and lower layer to do insertions but for InMemHNSWIndex these can be the same function (see my comment in Problem 3 below).
Problem 3: HNSWIndex insert functions: Your InMemHNSWIndex::insertToUpperLayer and InMemHNSWIndex::insertToLowerLayer functions deviate from the design doc. These can actually be the same function that takes in an entry point (the first node).
Problem 3 is a separate problem. It's more like a bug, and not a design decision. For the Problem 1 and 2, I think we can move to the following design:
- Remove HNSWGraph class hierarchy and only have an InMemHNSWGraph class: If you can re-use OnDiskGraph as is, we don't need to form a class hierarchy here and can only have an InMemHNSWGraph class. You could consider renaming this to HNSWGraphBuilder because the main functionality of this class is not that it's an in-memory implementation but that it's used for building the HNSWGraph. In general, I find it better to give class and field names based on functionality.
- HNSWIndex class hierarchy: Here we could still have the same class hierarchy of HNSWIndex (or BaseHNSWIndex), InMemHNSWIndex (or HNSWIndexBuilder), and OnDiskHNSWIndex. HNSWIndex (or BaseHNSWIndex) would contain common fields such as config Embedding* field. InMemHNSWIndex would have two InMemHNSWGraph fields and coordinate inserting into both lower and upper InMemHNSWGraph (though with a single common search function and correct upper layer insertion algorithm). Finally, OnDiskHNSWIndex would contain two OnDiskGraphs for upper and lower layers and implement the search functionality.
Other than that, I have several concerns about sitting down and designing with Ziyi a long term solution to our binding, planning, and mapping phases of table functions. I'm not clear about the clear roadmap to how we build additional indices in the system as extensions and what interfaces the extension builder implements. I see some rewriting into CREATE REL TABLE statements but then because of how we create rel tables, there is some custom logic that gets deep into the map_copy_rel. That does not seem like how we can create an extension framework. We need to think about clear interfaces and contracts between the core system and extensions that will build indices. I think we can sacrifice some efficiency here if we have to.
Update: As per our discussions, let's also check that we can successfully error on a query like this:
MATCH (a:Docs)
CALL CREATE_HNSW_INDEX( ... )
RETURN *
The RETURN statement might let this query run. Without RETURN the parser apparently should be erroring. With the RETURN, we might need some error checking somewhere.
explicit CreateHNSWLocalState(common::offset_t numNodes) : visited{numNodes} {} | ||
}; | ||
|
||
struct QueryHNSWIndexBindData final : SimpleTableFuncBindData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the fields you keep here do not seem to belong to the binding phase. They look like fields you should construct after mapping phase. Can you propose how to clean this up (and possibly also clean up other table functions)? If possible, I would also do this fix in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I temporarily moved storage level fields into the shared state, which is a fine solution, before I can start to refactor the mapping of table functions. Put a todo item under "refactoring" category for now.
c51317f
to
524370e
Compare
…constructor params
Benchmark ResultMaster commit hash:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4578 +/- ##
==========================================
+ Coverage 86.22% 86.31% +0.08%
==========================================
Files 1369 1384 +15
Lines 58232 59127 +895
Branches 7206 7288 +82
==========================================
+ Hits 50213 51034 +821
- Misses 7855 7929 +74
Partials 164 164 ☔ View full report in Codecov by Sentry. |
Description
This is the first version of the two-layer HNSW-based vector index. We keep two layers, each of which is a graph and is stored as a forward only rel table.
Syntax
Supported syntax are
In this PR,
L2
,Cosine
,L2SQ
, andDot Product
are supported as distance functions.Query Plan
Note that the planning part should be refactored by @andyfengHKU after this PR is merged.
Index creation
The index creation follows a similar way as how COPY REL works. The
CREATE_HNSW_INDEX
call statement is compiled into three pipelines. The first one constructs in-memory index, which consists of upper and lower layer graphs, and will convert the lower and upper in-memory graphs to flat tuples consisting of src and dst node offsets, respectively, which are shared with the second and third pipelines.The second and third pipelines are reusing 'RelBatchInsert' to organize tuples from lower and upper layer graphs into node groups and flush them as rel tables.
Note that the above procedure assumes that we creates two rel tables for the upper and lower layer, respectively. This is done through the
rewriteFunc
, which generates a query string that is executed before the actual index create function.Also, in this PR, to simplify the design and improve the performance, we cache the whole embedding column from the node table. Note that this can potentially improve the performance by avoiding random reads from disk, but reduces the scalability of the index in the longer term. For example, assume embeddings are not compressed, for FLOAT[1024], a single embedding is 4KB, which is a reasonable size to read from disk, while caching 10M embeddings of such in memory will cost 40GB of memory.
Index search
The index search (
QUERY_HNSW_INDEX
) is compiled into a hash join. The build side is the table function which does searches in the index, and returnk
tuples, and the probe side is scan of the base node table. Note that we will force semi mask to be passed from the build side to the probe side to reduce unnecessary scans, assumingk
is small. But eventually we should let optimizer to figure out based on cardinality which side is build side and whether to pass semi mask or not.Also, potentially we will replace the hash join with lookups instead, but not in this first initial PR.
For reviewer:
Pointers to important changes in the PR: