Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MATCH Query Always Using Sequential Scan, Ignoring Index Scan #2137

Open
pritish-moharir opened this issue Dec 2, 2024 · 8 comments
Open
Labels
question Further information is requested

Comments

@pritish-moharir
Copy link

pritish-moharir commented Dec 2, 2024

Hi everyone,

I'm encountering an issue with Apache AGE where a complex MATCH query always defaults to using a sequential scan, even though indexes exist on the queried columns. Disabling sequential scans via SET enable_seqscan=off has no effect, and the query plan i.e. explain analyze output continues to show a sequential scan.

Query Example:

Here's a simplified version of the query we are using:

SELECT * FROM cypher('graph_name', $$  
MATCH (n1:NodeType1)  
WHERE n1.attribute1 = '<value1>'  
  AND n1.attribute2 IN ('<value2>')  
WITH n1  
OPTIONAL MATCH (n1)-[:RelType1_NodeType1]-(n2:NodeType2)  
WITH n1, n2  
OPTIONAL MATCH (n1)-[:RelType2_NodeType1]-(n3:NodeType3)  
WITH n1, n2, n3  
OPTIONAL MATCH (n1)-[:RelType3_NodeType1]-(n4:NodeType4)  
RETURN DISTINCT n1 AS Node1, n2 AS Node2, n3 AS Node3, n4 AS Node4  
$$) AS (result_column agtype);

Data Setup:

We have populated the graph with data using queries like the following :

SELECT * FROM cypher('graph_name', $$  
MERGE (n:NodeType1 {key1: "value1"})  
SET n.property1 = "value1",  
    n.property2 = "value2",  
    n.property3 = "value3",   
    ....
$$) AS (result_column agtype); 

Problem:

The query plan indicates that a sequential scan is being used on NodeType1 and other nodes, despite indexes being present on attribute1 and attribute2. For performance, we expect the query to utilize the indexes for an index scan.

Observed Behavior:

The query consistently uses sequential scans.
Setting enable_seqscan = off doesn't change the behavior.

Expected Behavior:

The query should leverage the indexes on NodeType1.attribute1 and NodeType1.attribute2 to perform an index scan.

Environment Details:

We are running a containerised apache age docker image on k8s.
Apache AGE version: release_PG16_1.5.0
PostgreSQL version: 16
K8S Version: v1.29.6

What We've Tried:

  • Verified that the relevant indexes exist.

  • Created indexes on individual properties, e.g., attribute1 and attribute2.
    CREATE INDEX idx_attribute1 ON graph_table USING btree ((properties->>'attribute1'));
    CREATE INDEX idx_attribute2 ON graph_table USING btree ((properties->>'attribute2'));

  • Created indexes on the entire properties column for broader coverage.
    CREATE INDEX idx_properties ON graph_table USING gin (properties);

  • Set enable_seqscan = off.

  • Rebuilt the indexes and reanalyzed the table using ANALYZE.

  • Simplified the query to test individual segments but observed the same issue.

  • The Merge queries are working as expected with indexes, as we see as significant difference in write latencies with and without indexes.

  • We have ingested a good amount of data (around 25k rows) so as to warrant an index scan.

Questions:

Why does the MATCH query ignore available indexes and use sequential scans?
Are there any specific configurations or query optimizations required to enable index scans in Apache AGE for graph queries?
Could this be a limitation or a bug in Apache AGE?
Any help or insights from the community would be greatly appreciated!

If additional information, logs, or examples are needed, please let me know.

Thank you!

@pritish-moharir pritish-moharir added the question Further information is requested label Dec 2, 2024
@MuhammadTahaNaveed
Copy link
Member

MuhammadTahaNaveed commented Dec 2, 2024

@pritish-moharir

You have to create index on the expression used by age to access a certain property.

issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH (n1:NodeType1)  
WHERE n1.name = 'node1'  
RETURN n1
$$) AS (result agtype);
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on "NodeType1" n1
   Filter: (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]) = '"node1"'::agtype)
(2 rows)
issue_2137=# CREATE INDEX idx_btree_name
ON test."NodeType1"
USING btree (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]));
CREATE INDEX
issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH (n1:NodeType1)  
WHERE n1.name = 'node1'  
RETURN n1                                                                          
$$) AS (result agtype);
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Index Scan using idx_btree_name on "NodeType1" n1
   Index Cond: (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]) = '"node1"'::agtype)
(2 rows)

If you want to utilize gin index, you need to use the filter like MATCH (n {att1: 'value1'}), since containment operator in where clause is not supported as of now. Below is an example:

issue_2137=# CREATE INDEX idx_gin                                                        
ON test."NodeType1"
USING gin (properties);
CREATE INDEX
issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH (n1:NodeType1 {name: "Node1"}) 
RETURN n1
$$) AS (result agtype);
                           QUERY PLAN                            
-----------------------------------------------------------------
 Bitmap Heap Scan on "NodeType1" n1
   Recheck Cond: (properties @> '{"name": "Node1"}'::agtype)
   ->  Bitmap Index Scan on idx_gin
         Index Cond: (properties @> '{"name": "Node1"}'::agtype)
(4 rows)

Also, querying undirected paths can really slow down the performance, so consider using directed paths in MATCH clause wherever possible. I hope this helps.

@neerajx86
Copy link

neerajx86 commented Dec 3, 2024

@MuhammadTahaNaveed Thanks for the reply, I believe having this documented would help the community.

Also, is there a documentation for agtype_ functions as well?

@pritish-moharir
Copy link
Author

Hey @MuhammadTahaNaveed thank you for the quick response, we were able to successfully create and use index scan for one of the label properties for ex "NodeType1".name in this case.

A few questions we still have are :

  1. The other labels like "NodeType2", "NodeType3" are still running sequential scans, so we were wondering if there is a way to index on the joins that are happening?

  2. One of the queries we are using is like below:

SELECT * FROM cypher('graph_name', $$  
MATCH (n1:NodeType1)  
WHERE n1.attribute1 = '<value1>'  
  AND n1.attribute2 IN ['<value2>']  
WITH n1  
OPTIONAL MATCH (n1)-[:RelType1_NodeType1*1..1]-(n2:NodeType2)  
WITH n1, n2  
OPTIONAL MATCH (n1)-[:RelType2_NodeType1*1..1]-(n3:NodeType3)  
WITH n1, n2, n3  
OPTIONAL MATCH (n1)-[:RelType3_NodeType1*1..1]-(n4:NodeType4)  
WITH n1, n2, n3, n4  
OPTIONAL MATCH (n2)-[:RelType4_NodeType2*1..1]-(n5:NodeType5)  
WITH n1, n2, n3, n4, n5  
OPTIONAL MATCH (n3)-[:RelType5_NodeType3*1..1]-(n6:NodeType6)  
WITH n1, n2, n3, n4, n5, n6  
OPTIONAL MATCH (n5)-[:RelType6_NodeType5*1..1]-(n7:NodeType7)  
WITH n1, n2, n3, n4, n5, n6, n7  
WHERE n7.attribute3 = '<value3>'  
  AND n6.attribute4 = '<value4>'  
  AND n4.attribute5 = '<value5>'  
WITH DISTINCT(n1) AS n1, n1.attributeKey AS n1_key  
ORDER BY n1.attributeKey SKIP 0 LIMIT 10  
OPTIONAL MATCH (n1)-[:RelType7_NodeType1*1..1]-(n8:NodeType8)  
WITH DISTINCT n8 AS n8, n1, n1.attributeKey AS n1_key, n8.attributeKey AS n8_key  
ORDER BY n1.attributeKey, n8.attributeKey SKIP 0 LIMIT 10  
OPTIONAL MATCH (n1)-[:RelType8_NodeType1*1..1]-(n9:NodeType9)  
WITH DISTINCT n9 AS n9, n1, n8, n1.attributeKey AS n1_key, n8.attributeKey AS n8_key, n9.attributeKey AS n9_key  
ORDER BY n1.attributeKey, n8.attributeKey, n9.attributeKey SKIP 0 LIMIT 10  
OPTIONAL MATCH (n1)-[:RelType9_NodeType1*1..1]-(n10:NodeType10)-[:RelType10_NodeType10*1..1]-(n11:NodeType11)  
WITH DISTINCT n11 AS n11, n1, n8, n9, n1.attributeKey AS n1_key, n8.attributeKey AS n8_key, n9.attributeKey AS n9_key, n11.attributeKey AS n11_key  
ORDER BY n1.attributeKey, n8.attributeKey, n9.attributeKey, n11.attributeKey SKIP 0 LIMIT 10  
RETURN DISTINCT n1 AS NodeType1, n8 AS NodeType8, n9 AS NodeType9, n11 AS NodeType11  
$$) AS (n1 agtype, n8 agtype, n9 agtype, n11 agtype);

Its a bit complex but we were able to run the index scan for NodeType1 attribute1 and attribute2, but even after creating index on the other NodeType attributes which are also queried, they are running sequential scan. Is there something we are missing, do you have any suggestions as to what fields can be indexed here? I apologise if this is a very straightforward question but any inputs will be really helpful!

Please let me know if you need any additional information. Thanks!

@MuhammadTahaNaveed
Copy link
Member

@pritish-moharir You need to create index on id column of node label, and index on start_id and end_id of edge label. PR #2117 does that by default btw.

issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH p=(:NodeType1)-[:RelType1_NodeType1]->(:NodeType2) 
RETURN p
$$) AS (result agtype);
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Hash Join
   Hash Cond: (_age_default_alias_1.end_id = _age_default_alias_2.id)
   ->  Hash Join
         Hash Cond: (_age_default_alias_1.start_id = _age_default_alias_0.id)
         ->  Seq Scan on "RelType1_NodeType1" _age_default_alias_1
         ->  Hash
               ->  Seq Scan on "NodeType1" _age_default_alias_0
   ->  Hash
         ->  Seq Scan on "NodeType2" _age_default_alias_2
(9 rows)
issue_2137=# CREATE UNIQUE INDEX idx_n1_id ON test."NodeType1" USING btree (id);
CREATE INDEX
issue_2137=# CREATE UNIQUE INDEX idx_n2_id ON test."NodeType2" USING btree (id);
CREATE INDEX
issue_2137=# CREATE INDEX idx_r1_id ON test."RelType1_NodeType1" USING btree (start_id, end_id);
CREATE INDEX
issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH p=(:NodeType1)-[:RelType1_NodeType1]->(:NodeType2) 
RETURN p
$$) AS (result agtype);
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Hash Join
   Hash Cond: (_age_default_alias_1.end_id = _age_default_alias_2.id)
   ->  Merge Join
         Merge Cond: (_age_default_alias_0.id = _age_default_alias_1.start_id)
         ->  Index Scan using idx_n1_id on "NodeType1" _age_default_alias_0
         ->  Index Scan using idx_r1_id on "RelType1_NodeType1" _age_default_alias_1
   ->  Hash
         ->  Index Scan using idx_n2_id on "NodeType2" _age_default_alias_2
(8 rows)

Also, I would like to reiterate that querying undirected paths can really slow down the performance, so consider using directed paths in MATCH clauses wherever possible.

@neerajx86 The agtype_ functions are primarily intended for internal use. You can find a list of available functions in the documentation. The documentation is a bit outdated but covers most of the available functions.

@pritish-moharir
Copy link
Author

pritish-moharir commented Dec 4, 2024

Hey @MuhammadTahaNaveed thanks again for the reply! We were able to create these indexes and the explain output is showing them being used.

Also thanks for the suggestion about the directions, but for us, the data is in such a way that results and time taken with and without direction is the same.

One more doubt we had was that even though our query is showing the indexes in explain output the actual query is taking a lot of time to converge. But for the same query if we specify max hop to be 1, i.e., *1..1 the query is using sequential scan and converging a lot faster even though indexes exist. Can you help us understand the scenario here?

Thanks again for all your help !

@xuan2orange
Copy link

@pritish-moharir

You have to create index on the expression used by age to access a certain property.

issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH (n1:NodeType1)  
WHERE n1.name = 'node1'  
RETURN n1
$$) AS (result agtype);
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on "NodeType1" n1
   Filter: (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]) = '"node1"'::agtype)
(2 rows)
issue_2137=# CREATE INDEX idx_btree_name
ON test."NodeType1"
USING btree (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]));
CREATE INDEX
issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH (n1:NodeType1)  
WHERE n1.name = 'node1'  
RETURN n1                                                                          
$$) AS (result agtype);
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Index Scan using idx_btree_name on "NodeType1" n1
   Index Cond: (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]) = '"node1"'::agtype)
(2 rows)

If you want to utilize gin index, you need to use the filter like MATCH (n {att1: 'value1'}), since containment operator in where clause is not supported as of now. Below is an example:

issue_2137=# CREATE INDEX idx_gin                                                        
ON test."NodeType1"
USING gin (properties);
CREATE INDEX
issue_2137=# SELECT * FROM cypher('test', $$EXPLAIN (costs off) MATCH (n1:NodeType1 {name: "Node1"}) 
RETURN n1
$$) AS (result agtype);
                           QUERY PLAN                            
-----------------------------------------------------------------
 Bitmap Heap Scan on "NodeType1" n1
   Recheck Cond: (properties @> '{"name": "Node1"}'::agtype)
   ->  Bitmap Index Scan on idx_gin
         Index Cond: (properties @> '{"name": "Node1"}'::agtype)
(4 rows)

Also, querying undirected paths can really slow down the performance, so consider using directed paths in MATCH clause wherever possible. I hope this helps.

I have created the btree index and it works.
But when i use 'LIMIT 500' to filter data, it doesn't work.
Here is my query:
create index node_name_idx on graph_name."MyNode" (ag_catalog.agtype_access_operator(properties,'"name"'));

EXPLAIN ANALYZE SELECT * FROM cypher('graph_name', $$ MATCH (v:MyNode) WHERE v.name = 'test' RETURN v SKIP 0 LIMIT 500 $$) as (V agtype);

Limit (cost=0.00..442.63 rows=500 width=32) (actual time=965.468..970.870 rows=500 loops=1) -> Seq Scan on "MyNode" "v" (cost=0.00..50239.95 rows=56751 width=32) (actual time=965.465..970.816 rows=500 loops=1) Filter: (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]) = '"test"'::agtype) Rows Removed by Filter: 326964 Planning Time: 0.280 ms Execution Time: 970.948 ms

Thanks!

@tronper123
Copy link

When you did the LIMIT test, did you turn off sequential scans? If you didn't the reason might be this: https://stackoverflow.com/questions/8566931/index-not-used-when-limit-is-used-in-postgres

@Jaczk
Copy link

Jaczk commented Dec 8, 2024

What do u prefer, using gin index or btree on node lable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants