You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As mentioned in the readme of this repository, the HSNW index is used for queries involving a constant vector as the second term of a distance function.
Consider the following table:
CREATETABLE "items" (
id BLOB,
my_text VARCHAR,
embedding FLOAT32[3072]
);
CREATEINDEXitems_embedding_indexON'items' USING HNSW (embedding) WITH (metric ='cosine')
In my use case, I need to find the records that are closer than a given threshold to a fixed item.
In my app, I hold a reference to such item via its id, not the embedding itself.
Whenever I need to retrieve similar items I have to:
Retrieve the embedding my_item_embedding with the query SELECT embedding FROM items WHERE id = my_item_id
Run this query to retrieve similar items:
SELECT id, my_text, array_cosine_distance(embedding, <my_item_embedding>::FLOAT[3072]) AS dist
FROM items
ORDER BY dist
LIMIT1000
This is too slow for my purpose, as in point (a) the embedding vector is copied from the db in the Python runtime of my app, and then back to the db in point (b).
Ideally, the embedding should not leave the database, however, if I run the following query, the engine does not use the HNSW index and, instead, use two sequential scans taking much more time.
SELECTi.id, i.my_textFROM items AS i
INNER JOIN items as j ON array_cosine_distance(i.embedding, j.embedding) <0.15WHEREj.id= my_item_id
My questions:
Is it possible to write a single query to achieve the same result as the one obtained using steps (i) and (ii)?
If not, is there a way to avoid the clause LIMIT 1000 in step (ii) and still ensure that the engine use the HSNW index? At the moment I am using a value that is way bigger than the expected maximum size of any cluster, but it's a hacky workaround. Shouldn't the HNSW index be used automatically when the WHRE/ON clause contains a distance function?
Thank you a lot! (note that I am not a db expert, please forgive any nonsense I may have written)
The text was updated successfully, but these errors were encountered:
As mentioned in the readme of this repository, the HSNW index is used for queries involving a constant vector as the second term of a distance function.
Consider the following table:
In my use case, I need to find the records that are closer than a given threshold to a fixed item.
In my app, I hold a reference to such item via its id, not the embedding itself.
Whenever I need to retrieve similar items I have to:
my_item_embedding
with the querySELECT embedding FROM items WHERE id = my_item_id
This is too slow for my purpose, as in point (a) the embedding vector is copied from the db in the Python runtime of my app, and then back to the db in point (b).
Ideally, the embedding should not leave the database, however, if I run the following query, the engine does not use the HNSW index and, instead, use two sequential scans taking much more time.
My questions:
LIMIT 1000
in step (ii) and still ensure that the engine use the HSNW index? At the moment I am using a value that is way bigger than the expected maximum size of any cluster, but it's a hacky workaround. Shouldn't the HNSW index be used automatically when the WHRE/ON clause contains a distance function?Thank you a lot! (note that I am not a db expert, please forgive any nonsense I may have written)
The text was updated successfully, but these errors were encountered: