Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

listdir/readdir SQL may be get improved #5417

Open
frostwind opened this issue Dec 23, 2024 · 22 comments
Open

listdir/readdir SQL may be get improved #5417

frostwind opened this issue Dec 23, 2024 · 22 comments
Labels
kind/feature New feature or request

Comments

@frostwind
Copy link

frostwind commented Dec 23, 2024

What would you like to be added:
If client use readdir or os.listdir in python, it will call below query in juicefs metadata DB:

SELECT * FROM "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=$1

This query can consume significant DB CPU in some scenarios. Eg, one of our workload will call this query against a directory with 6.9k symlinks repeatedly and this query consume 90% of overall CPU among all juicefs queries with 100ms avg time cost under high load.
Seems both jfs_edge and jfs_node include "parent" column, it is possible to create a covered index on jfs_node on parent column and rewrite the query, eg in PostgreSQL

create index jfs_node_multi on jfs_node (parent) include (inode,type,flags,mode,uid,gid,atime,mtime,ctime,atimensec,mtimensec,ctimensec,nlink,length,rdev,access_acl_id,default_acl_id);
### original query and without jfs_node_multi index
test=# explain (analyze,buffers,timing) Select * FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=550511;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=10.60..2542.91 rows=209 width=152) (actual time=0.234..7.137 rows=6927 loops=1)
   Buffers: shared hit=27868
   ->  Bitmap Heap Scan on jfs_edge  (cost=10.17..789.91 rows=209 width=56) (actual time=0.220..0.704 rows=6927 loops=1)
         Recheck Cond: (parent = 550511)
         Heap Blocks: exact=99
         Buffers: shared hit=160
         ->  Bitmap Index Scan on "UQE_jfs_edge_edge"  (cost=0.00..10.12 rows=209 width=0) (actual time=0.206..0.206 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Buffers: shared hit=61
   ->  Index Scan using jfs_node_pkey on jfs_node  (cost=0.43..8.39 rows=1 width=96) (actual time=0.001..0.001 rows=1 loops=6927)
         Index Cond: (inode = jfs_edge.inode)
         Buffers: shared hit=27708
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.169 ms
 Execution Time: 7.312 ms
(16 rows)

### rewritten query and with jfs_node_multi index
test=# Explain (analyze,buffers,timing)   
SELECT *  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=550511 and 
 "jfs_edge"."parent" = "jfs_node"."parent" ;
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=32.96..813.25 rows=1 width=152) (actual time=1.928..3.478 rows=6927 loops=1)
   Hash Cond: (jfs_edge.inode = jfs_node.inode)
   Buffers: shared hit=274
   ->  Bitmap Heap Scan on jfs_edge  (cost=10.17..789.91 rows=209 width=56) (actual time=0.190..0.607 rows=6927 loops=1)
         Recheck Cond: (parent = 550511)
         Heap Blocks: exact=99
         Buffers: shared hit=160
         ->  Bitmap Index Scan on "UQE_jfs_edge_edge"  (cost=0.00..10.12 rows=209 width=0) (actual time=0.178..0.179 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Buffers: shared hit=61
   ->  Hash  (cost=20.14..20.14 rows=212 width=96) (actual time=1.733..1.734 rows=6927 loops=1)
         Buckets: 8192 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 984kB
         Buffers: shared hit=114
         ->  Index Only Scan using jfs_node_multi on jfs_node  (cost=0.43..20.14 rows=212 width=96) (actual time=0.008..0.869 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Heap Fetches: 0
               Buffers: shared hit=114
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.167 ms
 Execution Time: 3.645 ms
(21 rows)

Using pgbench with old SQL shows 334 tps and used 80% of CPU of a 48 CPU cores DB.

-bash-4.2$ cat old.sql 
Select * FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=550511;
-bash-4.2$ /usr/pgsql-13/bin/pgbench  -f old.sql test -j 100 -c 40 -P 2 -T 30 -U jfs_admin -h jfs_metadb
starting vacuum...end.
progress: 2.0 s, 295.6 tps, lat 130.809 ms stddev 47.563
progress: 4.0 s, 338.1 tps, lat 118.483 ms stddev 16.345
progress: 6.0 s, 336.2 tps, lat 118.398 ms stddev 15.633
progress: 8.0 s, 337.8 tps, lat 118.040 ms stddev 15.358
progress: 10.1 s, 340.4 tps, lat 118.434 ms stddev 15.288
progress: 12.0 s, 335.4 tps, lat 118.459 ms stddev 15.438
progress: 14.0 s, 337.0 tps, lat 119.277 ms stddev 14.964
progress: 16.0 s, 336.0 tps, lat 118.816 ms stddev 14.577
progress: 18.0 s, 334.0 tps, lat 119.072 ms stddev 15.375
progress: 20.0 s, 336.6 tps, lat 119.147 ms stddev 17.556
progress: 22.0 s, 338.4 tps, lat 119.154 ms stddev 16.511
progress: 24.0 s, 334.1 tps, lat 119.312 ms stddev 14.115
progress: 26.1 s, 335.2 tps, lat 119.278 ms stddev 15.041
progress: 28.0 s, 334.2 tps, lat 119.526 ms stddev 14.593
progress: 30.0 s, 334.0 tps, lat 119.341 ms stddev 14.609
transaction type: old.sql
scaling factor: 1
query mode: simple
number of clients: 40
number of threads: 40
duration: 30 s
number of transactions actually processed: 10045
latency average = 119.586 ms
latency stddev = 19.168 ms
tps = 334.007732 (including connections establishing)
tps = 334.113489 (excluding connections establishing)

Using pgbench with new SQL and new index shows 522 tps and only used 10% of CPU of a 48 CPU cores DB.

-bash-4.2$ cat new.sql 
Select  *  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=550511 and 
 "jfs_edge"."parent" = "jfs_node"."parent" ;
-bash-4.2$ /usr/pgsql-13/bin/pgbench  -f new.sql test -j 100 -c 40 -P 2 -T 30 -U jfs_admin -h jfs_metadb
starting vacuum...end.
progress: 2.1 s, 478.0 tps, lat 81.521 ms stddev 49.978
progress: 4.1 s, 526.6 tps, lat 76.203 ms stddev 27.147
progress: 6.2 s, 533.0 tps, lat 74.711 ms stddev 25.622
progress: 8.1 s, 511.2 tps, lat 78.468 ms stddev 26.413
progress: 10.2 s, 539.5 tps, lat 74.332 ms stddev 27.289
progress: 12.2 s, 535.1 tps, lat 74.372 ms stddev 27.332
progress: 14.0 s, 527.0 tps, lat 76.630 ms stddev 26.517
progress: 16.1 s, 535.5 tps, lat 74.233 ms stddev 25.937
progress: 18.1 s, 538.5 tps, lat 74.350 ms stddev 25.623
progress: 20.2 s, 527.6 tps, lat 75.795 ms stddev 28.760
progress: 22.1 s, 530.9 tps, lat 74.661 ms stddev 27.041
progress: 24.1 s, 489.5 tps, lat 81.891 ms stddev 26.591
progress: 26.1 s, 534.0 tps, lat 75.319 ms stddev 27.108
progress: 28.2 s, 504.6 tps, lat 78.851 ms stddev 28.382
progress: 30.1 s, 527.6 tps, lat 76.510 ms stddev 25.344
transaction type: new.sql
scaling factor: 1
query mode: simple
number of clients: 40
number of threads: 40
duration: 30 s
number of transactions actually processed: 15709
latency average = 76.420 ms
latency stddev = 28.942 ms
tps = 522.584279 (including connections establishing)
tps = 522.803386 (excluding connections establishing)

In PostgreSQL, for covered index, there may be performance degradation when "vacuum" is not happening in a timely manner and cause tuple visibility issue, if we configure "vacuum" properly, the chance to encounter such degradation should be minimal.
In MySQL maybe there is no "include" syntax but could create a normal index with all the necessary columns can meet the same goal.

Why is this needed:

@frostwind frostwind added the kind/feature New feature or request label Dec 23, 2024
@frostwind
Copy link
Author

frostwind commented Dec 23, 2024

Also may need to consider jfs_edge.parent!=jfs_node.parent. Is it hardlink?

test=# select count(*) from jfs_node jn,jfs_edge je where jn.inode=je.inode;
  count  
---------
 2283227
(1 row)

test=# select count(*) from jfs_node jn,jfs_edge je where jn.inode=je.inode and jn.parent=je.parent;
  count  
---------
 2283219
(1 row)

test=# select jn.inode , jn.parent,je.parent from jfs_node jn,jfs_edge je where jn.inode=je.inode and jn.parent!=je.parent;
  inode  | parent | parent  
---------+--------+---------
      40 |      0 |      34
      41 |      0 |      34
    1064 |      0 |    1058
 3699420 |      0 | 1938494
    1065 |      0 |    1058
  110140 |      0 |  110134
  110141 |      0 |  110134
 3699421 |      0 | 1938494
(8 rows)

Considering existence of hard link, we may modify the query to

test=# Explain (analyze,buffers,timing)  Select  *  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"= 550511 and 
 "jfs_edge"."parent" = "jfs_node"."parent" 
 Union all
 Select  *  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"= 550511 and 
 "jfs_node"."parent"=0 ;
                                                                          QUERY PLAN                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=793.07..2922.58 rows=2 width=152) (actual time=1.108..3.746 rows=6927 loops=1)
   Buffers: shared hit=436
   ->  Hash Join  (cost=793.07..2093.05 rows=1 width=152) (actual time=1.107..2.663 rows=6927 loops=1)
         Hash Cond: (jfs_node.inode = jfs_edge.inode)
         Buffers: shared hit=271
         ->  Index Only Scan using jfs_node_multi on jfs_node  (cost=0.55..1268.76 rows=8469 width=96) (actual time=0.018..0.680 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Heap Fetches: 0
               Buffers: shared hit=111
         ->  Hash  (cost=789.91..789.91 rows=209 width=56) (actual time=1.079..1.080 rows=6927 loops=1)
               Buckets: 8192 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 741kB
               Buffers: shared hit=160
               ->  Bitmap Heap Scan on jfs_edge  (cost=10.17..789.91 rows=209 width=56) (actual time=0.197..0.610 rows=6927 loops=1)
                     Recheck Cond: (parent = 550511)
                     Heap Blocks: exact=99
                     Buffers: shared hit=160
                     ->  Bitmap Index Scan on "UQE_jfs_edge_edge"  (cost=0.00..10.12 rows=209 width=0) (actual time=0.184..0.184 rows=6927 loops=1)
                           Index Cond: (parent = 550511)
                           Buffers: shared hit=61
   ->  Hash Join  (cost=49.23..829.52 rows=1 width=152) (actual time=0.811..0.811 rows=0 loops=1)
         Hash Cond: (jfs_edge_1.inode = jfs_node_1.inode)
         Buffers: shared hit=165
         ->  Bitmap Heap Scan on jfs_edge jfs_edge_1  (cost=10.17..789.91 rows=209 width=56) (actual time=0.134..0.457 rows=6927 loops=1)
               Recheck Cond: (parent = 550511)
               Heap Blocks: exact=99
               Buffers: shared hit=160
               ->  Bitmap Index Scan on "UQE_jfs_edge_edge"  (cost=0.00..10.12 rows=209 width=0) (actual time=0.124..0.124 rows=6927 loops=1)
                     Index Cond: (parent = 550511)
                     Buffers: shared hit=61
         ->  Hash  (cost=36.35..36.35 rows=217 width=96) (actual time=0.012..0.012 rows=6 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               Buffers: shared hit=5
               ->  Index Only Scan using jfs_node_multi on jfs_node jfs_node_1  (cost=0.55..36.35 rows=217 width=96) (actual time=0.009..0.010 rows=6 loops=1)
                     Index Cond: (parent = 0)
                     Heap Fetches: 0
                     Buffers: shared hit=5
 Planning:
   Buffers: shared hit=32
 Planning Time: 0.308 ms
 Execution Time: 3.924 ms
(40 rows)

pgbench , about 10% CPU usage.

-bash-4.2$ /usr/pgsql-13/bin/pgbench  -f test.sql  test -j 100 -c 40 -P 2 -T 30 -U jfs_admin -h jfs_metadb
starting vacuum...end.
progress: 2.0 s, 413.1 tps, lat 94.024 ms stddev 70.115
progress: 4.3 s, 512.9 tps, lat 78.322 ms stddev 29.543
progress: 6.1 s, 499.0 tps, lat 79.782 ms stddev 29.018
progress: 8.2 s, 509.7 tps, lat 78.812 ms stddev 28.481
progress: 10.0 s, 516.2 tps, lat 76.803 ms stddev 26.899
progress: 12.1 s, 503.9 tps, lat 79.847 ms stddev 28.933
progress: 14.1 s, 493.0 tps, lat 80.769 ms stddev 28.313
progress: 16.2 s, 505.2 tps, lat 79.551 ms stddev 27.584
progress: 18.2 s, 492.8 tps, lat 81.353 ms stddev 28.995
progress: 20.1 s, 491.7 tps, lat 80.742 ms stddev 29.170
progress: 22.2 s, 506.6 tps, lat 79.189 ms stddev 30.572
progress: 24.0 s, 503.0 tps, lat 80.402 ms stddev 27.035
progress: 26.2 s, 517.5 tps, lat 76.720 ms stddev 30.270
progress: 28.3 s, 520.4 tps, lat 76.922 ms stddev 29.414
progress: 30.0 s, 510.1 tps, lat 78.683 ms stddev 27.913
transaction type: test.sql
scaling factor: 1
query mode: simple
number of clients: 40
number of threads: 40
duration: 30 s
number of transactions actually processed: 15022
latency average = 79.901 ms
latency stddev = 32.757 ms
tps = 499.893043 (including connections establishing)
tps = 500.118795 (excluding connections establishing)

We also need to be cautious on a file system with many hardlinks, which will degrade parent=0 subquery. Let me double check its performance.

@frostwind
Copy link
Author

frostwind commented Dec 23, 2024

Considering hard link, we may use below query (only selecting jfs_edge.name since juicefs main branch seems already make changes to only select jfs_edge.name instead of jfs_edge.* ).

test=# Explain (analyze,buffers,timing)  Select  "jfs_edge".name,"jfs_node".*  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"= 550511 and 
 "jfs_edge"."parent" = "jfs_node"."parent" 
 Union all
 Select "jfs_edge".name,"jfs_node".* from "jfs_edge" , "jfs_node"  where "jfs_edge".parent= 550511 and "jfs_edge".inode="jfs_node".inode and "jfs_node".parent=0 ;
                                                                           QUERY PLAN                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.86..3918.23 rows=5 width=129) (actual time=0.023..3.754 rows=6927 loops=1)
   Buffers: shared hit=1043
   ->  Merge Join  (cost=0.86..1312.01 rows=1 width=129) (actual time=0.022..2.645 rows=6927 loops=1)
         Merge Cond: (jfs_edge.inode = jfs_node.inode)
         Buffers: shared hit=582
         ->  Index Scan using jfs_edge_parent_inode on jfs_edge  (cost=0.43..1290.98 rows=553 width=49) (actual time=0.011..0.727 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Buffers: shared hit=457
         ->  Index Only Scan using jfs_node_multi on jfs_node  (cost=0.43..19.44 rows=172 width=96) (actual time=0.008..0.828 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Heap Fetches: 133
               Buffers: shared hit=125
   ->  Merge Join  (cost=1.03..2606.19 rows=4 width=129) (actual time=0.834..0.835 rows=0 loops=1)
         Merge Cond: (jfs_edge_1.inode = jfs_node_1.inode)
         Buffers: shared hit=461
         ->  Index Scan using jfs_edge_parent_inode on jfs_edge jfs_edge_1  (cost=0.43..1290.98 rows=553 width=41) (actual time=0.006..0.543 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Buffers: shared hit=457
         ->  Index Only Scan using jfs_node_multi on jfs_node jfs_node_1  (cost=0.43..1288.56 rows=15397 width=96) (actual time=0.007..0.008 rows=7 loops=1)
               Index Cond: (parent = 0)
               Heap Fetches: 0
               Buffers: shared hit=4
 Planning:
   Buffers: shared hit=32
 Planning Time: 0.308 ms
 Execution Time: 3.927 ms
(26 rows)



Below index are also needed

create index jfs_node_multi on jfs_node (parent, inode) include (type,flags,mode,uid,gid,atime,mtime,ctime,atimensec,mtimensec,ctimensec,nlink,length,rdev,access_acl_id,default_acl_id);

Create  index  jfs_edge_parent_inode_name on jfs_edge (parent,inode,name) ;

If database does not support covered index, "jfs_node_multi" index can be converted to normal index with all columns, or just include (parent,inode) columns from jfs_node , can also reduce a lot buffer gets.

create index jfs_node_multi on jfs_node (parent, inode);

In previous test, I create copied a directory with 14k hard links.

test=# select count(*) from jfs_node where parent=0;
 count 
-------
 14165
(1 row)

Then I use below script to generate 1 million hard links against the same file. The query plan and buffer gets still keep stable

touch file
for x in {1..1000};do echo $x;mkdir $x;for y in {1..1000}; do ln file $x/$y;done done

@frostwind
Copy link
Author

frostwind commented Dec 24, 2024

Ran another test to create 10k files, then create 1000 directories , each directory contains 10k hard link point to the 10k file created before.

# cat run.sh
for x in {1..10000};do echo $x;echo "hehe" > file_$x; done
for y in {1..1000};do echo $y; mkdir -p dir_$y; for x in {1..10000}; do ln file_$x dir_$y/$x;done done

Seems juicefs use prepared statement, so test it with prepared statement. Query plan still keep stable with a few hundreds to 3000 buffer gets. Also equipped aggressive vacuum parameter in postgresql 17 to make vacuum on jfs_node to happen more frequently to avoid too much buffer gets waste on "Heap Fetches".

PREPARE readdir(bigint) AS
Select  "jfs_edge".name,"jfs_node".*  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"= $1 and 
 "jfs_edge"."parent" = "jfs_node"."parent" 
 Union all
 Select "jfs_edge".name,"jfs_node".* from "jfs_edge" , "jfs_node"  where "jfs_edge".parent= $2 and "jfs_edge".inode="jfs_node".inode and "jfs_node".parent=0 ; 
Explain (analyze,buffers,timing)  EXECUTE readdir(6605868, 6605868);
                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.99..2501.29 rows=35 width=108) (actual time=0.941..8.182 rows=11002 loops=1)
   Buffers: shared hit=2614
   ->  Merge Join  (cost=0.99..217.61 rows=1 width=108) (actual time=0.940..1.875 rows=1002 loops=1)
         Merge Cond: (jfs_edge.inode = jfs_node.inode)
         Buffers: shared hit=786
         ->  Index Only Scan using jfs_edge_parent_inode_name on jfs_edge  (cost=0.56..188.86 rows=3674 width=28) (actual time=0.012..0.752 rows=11002 loops=1)
               Index Cond: (parent = '6605868'::bigint)
               Heap Fetches: 0
               Buffers: shared hit=84
         ->  Index Only Scan using jfs_node_multi on jfs_node  (cost=0.43..19.37 rows=168 width=96) (actual time=0.008..0.695 rows=1002 loops=1)
               Index Cond: (parent = '6605868'::bigint)
               Heap Fetches: 0
               Buffers: shared hit=702
   ->  Merge Join  (cost=1.09..2283.50 rows=34 width=108) (actual time=1.828..5.868 rows=10000 loops=1)
         Merge Cond: (jfs_edge_1.inode = jfs_node_1.inode)
         Buffers: shared hit=1828
         ->  Index Only Scan using jfs_edge_parent_inode_name on jfs_edge jfs_edge_1  (cost=0.56..188.86 rows=3674 width=20) (actual time=0.007..0.658 rows=10001 loops=1)
               Index Cond: (parent = '6605868'::bigint)
               Heap Fetches: 0
               Buffers: shared hit=66
         ->  Index Only Scan using jfs_node_multi on jfs_node jfs_node_1  (cost=0.43..2051.47 rows=21888 width=96) (actual time=0.009..3.447 rows=24166 loops=1)
               Index Cond: (parent = 0)
               Heap Fetches: 0
               Buffers: shared hit=1762
 Planning:
   Buffers: shared hit=32
 Planning Time: 0.323 ms
 Execution Time: 8.438 ms
(28 rows)


To avoid the jfs_node_multi generating too much "heap fetches" in PostgreSQL, I am using below setting to reduce frequency of autovacuum

autovacuum_naptime = 10s
autovacuum_vacuum_threshold = 50 # default
autovacuum_vacuum_insert_threshold = 1000 # default
autovacuum_vacuum_scale_factor = 0 
autovacuum_vacuum_insert_scale_factor = 0
autovacuum_analyze_scale_factor = 0.01

Also setup cronjob with "(INDEX_CLEANUP on)" vacuum setting to allow it clean up dead tuple in index.

* * * * * for x in `psql  -t -c "select datname from pg_database where datname like 'jfs%' "`;do echo $x; psql  $x -c "vacuum (INDEX_CLEANUP on)"  ;done

A simpler way is to enable such setting as table level, probably we just need this vacuum_index_cleanup for jfs_node.

# alter table jfs_node SET (vacuum_index_cleanup = on);
ALTER TABLE

Anyway above special vacuum setting is just the pain for PostgreSQL user. If juicefs can implement query rewriting with proper index setup, and provide above guidance on vacuum setting to PostgreSQL user. That would be good enough.
eg in MySQL we use a covered index,

create index jfs_node_multi on jfs_node (parent, inode, type,flags,mode,uid,gid,atime,mtime,ctime,atimensec,mtimensec,ctimensec,nlink,length,rdev,access_acl_id,default_acl_id) ;

In PostgreSQL we use a "include" index.

create index jfs_node_multi on jfs_node (parent, inode) include (type,flags,mode,uid,gid,atime,mtime,ctime,atimensec,mtimensec,ctimensec,nlink,length,rdev,access_acl_id,default_acl_id);

If we have difficulty to create different index syntax in different DB, can just create the covered index as MySQL and let decision to PostgreSQL user if they want to convert that covered index to an "include" index.
For MySQL , the primary key may be created on jfs_node (parent,inode) , along with a unique index on jfs_node(inode) to allow us using clustering feature of InnoDB of primary key because InnoDB is index organized table, in this case we will not need a covered index to include all columns on jfs_node , the primary key on jfs_node (parent,inode) may serve the same purpose of covered index.

@jiefenghuang
Copy link
Contributor

Thank you for sharing the profile results. We will delve deeper and improve this.

@anysql
Copy link
Contributor

anysql commented Dec 26, 2024

There is problem for your query, as the hard link may created in different directory, so they have different parent node.

@frostwind
Copy link
Author

@anysql
In later comment I provide another version with considering hard link. The 2nd part of "UNION ALL" is querying the hard link.

Select  "jfs_edge".name,"jfs_node".*  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"= 550511 and 
 "jfs_edge"."parent" = "jfs_node"."parent" 
 Union all
 Select "jfs_edge".name,"jfs_node".* from "jfs_edge" , "jfs_node"  where "jfs_edge".parent= 550511 and "jfs_edge".inode="jfs_node".inode and "jfs_node".parent=0 ;

@anysql
Copy link
Contributor

anysql commented Dec 27, 2024

Inode is the primary key of jfs_node table, use nest loop join is enough.

High cost due to merge join used, may be you should analyze your tables?

@frostwind
Copy link
Author

frostwind commented Dec 27, 2024

@anysql
Table stats are updated. The merge join is not culprit causing slowness. The original query plan shows

Buffers: shared hit=27868

Which is costly since about 27k buffer gets * 8K block size= 216MB data scanned , this is consuming about 30 ms DB CPU in idle time and over 100ms DB CPU in busy time. pgbench shows original query can only reach 300+ TPS with 80% DB CPU usage.
With optimization of additional index and query tuning, buffer gets can drop to 1k from 27K , that is a lot of improvement. And pgbench shows 500+ TPS after tuning with only 10% DB CPU usage.

@anysql
Copy link
Contributor

anysql commented Dec 27, 2024

Can you check the cost of nest loop join in pg?

@anysql
Copy link
Contributor

anysql commented Dec 27, 2024

ListDir and ReadDir usually read in batch, batch size is limited by fuse, may be few KBs.

@frostwind
Copy link
Author

frostwind commented Dec 27, 2024

Actually the query plan with "merge join" is good query plan after rewriting query and adding index. The original slow query does not use "merge join". Let me copy paste them here to make it clear:

### original slow query
test=# explain (analyze,buffers,timing) Select * FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=550511;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=10.60..2542.91 rows=209 width=152) (actual time=0.234..7.137 rows=6927 loops=1)
   Buffers: shared hit=27868
   ->  Bitmap Heap Scan on jfs_edge  (cost=10.17..789.91 rows=209 width=56) (actual time=0.220..0.704 rows=6927 loops=1)
         Recheck Cond: (parent = 550511)
         Heap Blocks: exact=99
         Buffers: shared hit=160
         ->  Bitmap Index Scan on "UQE_jfs_edge_edge"  (cost=0.00..10.12 rows=209 width=0) (actual time=0.206..0.206 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Buffers: shared hit=61
   ->  Index Scan using jfs_node_pkey on jfs_node  (cost=0.43..8.39 rows=1 width=96) (actual time=0.001..0.001 rows=1 loops=6927)
         Index Cond: (inode = jfs_edge.inode)
         Buffers: shared hit=27708
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.169 ms
 Execution Time: 7.312 ms
(16 rows)


### Create two additional  indexes
## index 1
create index jfs_node_multi on jfs_node (parent, inode) include (type,flags,mode,uid,gid,atime,mtime,ctime,atimensec,mtimensec,ctimensec,nlink,length,rdev,access_acl_id,default_acl_id);
## index 2
Create  index  jfs_edge_parent_inode_name on jfs_edge (parent,inode,name) ;

### Rewrite query with two additional index above
PREPARE readdir(bigint) AS
Select  "jfs_edge".name,"jfs_node".*  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"= $1 and 
 "jfs_edge"."parent" = "jfs_node"."parent" 
 Union all
 Select "jfs_edge".name,"jfs_node".* from "jfs_edge" , "jfs_node"  where "jfs_edge".parent= $2 and "jfs_edge".inode="jfs_node".inode and "jfs_node".parent=0 ; 
Explain (analyze,buffers,timing)  EXECUTE readdir(550511, 550511);

                                                                               QUERY PLAN                                                                                
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.99..2990.92 rows=41 width=108) (actual time=0.095..4.179 rows=6927 loops=1)
   Buffers: shared hit=261
   ->  Merge Join  (cost=0.99..327.73 rows=1 width=108) (actual time=0.095..3.235 rows=6927 loops=1)
         Merge Cond: (jfs_edge.inode = jfs_node.inode)
         Buffers: shared hit=184
         ->  Index Only Scan using jfs_edge_parent_inode_name on jfs_edge  (cost=0.56..294.84 rows=3762 width=28) (actual time=0.052..0.915 rows=6927 loops=1)
               Index Cond: (parent = '550511'::bigint)
               Heap Fetches: 0
               Buffers: shared hit=70
         ->  Index Only Scan using jfs_node_multi on jfs_node  (cost=0.43..23.34 rows=166 width=96) (actual time=0.040..1.371 rows=6927 loops=1)
               Index Cond: (parent = '550511'::bigint)
               Heap Fetches: 0
               Buffers: shared hit=114
   ->  Merge Join  (cost=1.09..2662.99 rows=40 width=108) (actual time=0.665..0.665 rows=0 loops=1)
         Merge Cond: (jfs_edge_1.inode = jfs_node_1.inode)
         Buffers: shared hit=77
         ->  Index Only Scan using jfs_edge_parent_inode_name on jfs_edge jfs_edge_1  (cost=0.56..294.84 rows=3762 width=20) (actual time=0.010..0.427 rows=6927 loops=1)
               Index Cond: (parent = '550511'::bigint)
               Heap Fetches: 0
               Buffers: shared hit=70
         ->  Index Only Scan using jfs_node_multi on jfs_node jfs_node_1  (cost=0.43..2320.00 rows=24847 width=96) (actual time=0.036..0.037 rows=7 loops=1)
               Index Cond: (parent = 0)
               Heap Fetches: 0
               Buffers: shared hit=7
 Planning:
   Buffers: shared hit=436
 Planning Time: 1.485 ms
 Execution Time: 4.372 ms
(28 rows)


This is a common problem for RDBMS, I expect MySQL should have similar benchmark.

@anysql
Copy link
Contributor

anysql commented Dec 27, 2024

As both jfs_edge and jfs_node have just few columns, create index with much columns will slow down the update operation. For pg it may hit vacum performance problem.

@frostwind
Copy link
Author

frostwind commented Dec 27, 2024

I understand your concern about a wide index may introduce slowness on update. In practice, we use PG 17 , and not seeing vacuum can cause much trouble in our workload. In later PG11+ version , vacuum speed/performance get improved a lot.
The goal of this ticket is to address the read amplification on jfs_node table. Without further optimization, even if user does not have hard link , it still need 4 buffer gets per row cost to read jfs_node for readdir. To address this , we may use "including index" which is available in PG11 which introduce less effort than a full covered index. For MySQL, we may modify the jfs_node primary key to (parent,inode) to give us better data clustering factor for such query, with add an additional unique index on (inode).
Basically jfs_edge only involve insert/delete, no update. jfs_node has much more update than insert and delete. Such index DDL changes should not bring too much overhead to MySQL users , since only (parent,inode) are included in primary key, not all other columns.
I also noticed that in juicefs main branch , the above slow query is replaced by two queries

SELECT "id" FROM "jfs_edge" WHERE (parent = $1) LIMIT 40960
SELECT "jfs_edge"."name", "jfs_node".* FROM "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."id" IN ($1,$2,$3,$4,$5,$6....,$6925,$6926,$6927) ORDER BY jfs_edge.name

The 2nd query contain a long IN-list with 6927 items (exactly the size of my directory) with 24k buffer gets. It has similar query plan as old query , the only benefit is only selecting "jfs_edge".name to save some buffer gets, but "jfs_node" is still contributing high buffer gets. In some PG version , such long In-list could cause memory leak with query parsing slowness. From DBA's point, we should be cautious on potential performance hit from such long IN-list query in database.
I understand making such change need to be carefully evaluated across wide adoption of different community user's database installation, but really appreciate that readdir can be further optimized at some way.

Update: MySQL batch key access feature may improve the read amplification but I haven't tested it yet. I only have Postgres env.

@polyrabbit
Copy link
Contributor

+1 for this performance issue, this join happens a lot in our slow query log.

Another issue is that the SQL engine generates a large number of connections. We have a cluster of ~10 clients, which results in tens of thousands of connections to the database. And most of them are short-lived connections.

@frostwind
Copy link
Author

The pgbench of readdir against dir with 6.9k symlink , is tested on bare metal DB with 48 cores CPU , it is consuming 80% CPU of all 48 cores, only reach 300+ TPS.

@anysql
Copy link
Contributor

anysql commented Dec 30, 2024

Please include the inode column for index "UQE_jfs_edge_edge", or adding a new index by (parent + inode).

Include syntax is not supported by all databases, so you need to do it by your self.

Can you bench the following query with merge join

SELECT * FROM "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode
WHERE "jfs_edge"."parent"=$1
and jfs_node.inode >= (select min(inode) from jfs_edge where parent = $1)
and jfs_node.inode <= (select max(inode) from jfs_edge where parent = $1)

@frostwind
Copy link
Author

frostwind commented Dec 30, 2024

With below two indexes.

Create  index  jfs_edge_parent_inode_name on jfs_edge (parent,inode,name) ;
create index jfs_node_multi on jfs_node (parent,inode) include (type,flags,mode,uid,gid,atime,mtime,ctime,atimensec,mtimensec,ctimensec,nlink,length,rdev,access_acl_id,default_acl_id);

The query performance is still not good if we use

SELECT * FROM "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode
WHERE "jfs_edge"."parent"=$1
and jfs_node.inode >= (select min(inode) from jfs_edge where parent = $1)
and jfs_node.inode <= (select max(inode) from jfs_edge where parent = $1)
                                                                                      QUERY PLAN                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1102.13..24794.05 rows=18 width=136) (actual time=1.034..10.578 rows=6927 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=27879
   InitPlan 2
     ->  Result  (cost=0.60..0.61 rows=1 width=8) (actual time=0.040..0.041 rows=1 loops=1)
           Buffers: shared hit=5
           InitPlan 1
             ->  Limit  (cost=0.56..0.60 rows=1 width=8) (actual time=0.039..0.039 rows=1 loops=1)
                   Buffers: shared hit=5
                   ->  Index Only Scan using jfs_edge_parent_inode_name on jfs_edge jfs_edge_1  (cost=0.56..155.59 rows=3602 width=8) (actual time=0.038..0.038 rows=1 loops=1)
                         Index Cond: (parent = '550511'::bigint)
                         Heap Fetches: 0
                         Buffers: shared hit=5
   InitPlan 4
     ->  Result  (cost=0.60..0.61 rows=1 width=8) (actual time=0.018..0.019 rows=1 loops=1)
           Buffers: shared hit=5
           InitPlan 3
             ->  Limit  (cost=0.56..0.60 rows=1 width=8) (actual time=0.018..0.018 rows=1 loops=1)
                   Buffers: shared hit=5
                   ->  Index Only Scan Backward using jfs_edge_parent_inode_name on jfs_edge jfs_edge_2  (cost=0.56..155.59 rows=3602 width=8) (actual time=0.018..0.018 rows=1 loops=1)
                         Index Cond: (parent = '550511'::bigint)
                         Heap Fetches: 0
                         Buffers: shared hit=5
   ->  Nested Loop  (cost=100.91..23791.02 rows=8 width=136) (actual time=0.270..5.169 rows=2309 loops=3)
         Buffers: shared hit=27869
         ->  Parallel Bitmap Heap Scan on jfs_edge  (cost=100.48..12665.66 rows=1501 width=40) (actual time=0.213..0.645 rows=2309 loops=3)
               Recheck Cond: (parent = '550511'::bigint)
               Heap Blocks: exact=49
               Buffers: shared hit=159
               ->  Bitmap Index Scan on "UQE_jfs_edge_edge"  (cost=0.00..99.58 rows=3602 width=0) (actual time=0.552..0.552 rows=6927 loops=1)
                     Index Cond: (parent = '550511'::bigint)
                     Buffers: shared hit=60
         ->  Index Scan using jfs_node_pkey on jfs_node  (cost=0.43..7.41 rows=1 width=96) (actual time=0.002..0.002 rows=1 loops=6927)
               Index Cond: ((inode = jfs_edge.inode) AND (inode >= (InitPlan 2).col1) AND (inode <= (InitPlan 4).col1))
               Buffers: shared hit=27710
 Planning:
   Buffers: shared hit=382
 Planning Time: 1.257 ms
 Execution Time: 10.805 ms
(40 rows)


And I migrate the same data set from PG 17 to MySQL 8.4, found the original query cost over 49k buffer gets in MySQL (show status 'Innodb_buffer_pool_read_requests'), after I modify primary key of jfs_node to (parent,inode) along with unique index on (inode), read amplification on jfs_node is gone, but instead it use jfs_node as driving table and read amplification goes to jfs_edge(not always, below test case is not reflecting it). eg

mysql> show status like 'Innodb_buffer_pool_read_requests';
+----------------------------------+-----------+
| Variable_name                    | Value     |
+----------------------------------+-----------+
| Innodb_buffer_pool_read_requests | 650013786 |
+----------------------------------+-----------+
1 row in set (0.00 sec)

mysql> Explain analyze SELECT * FROM  jfs_edge INNER JOIN jfs_node ON jfs_edge.inode=jfs_node.inode WHERE jfs_edge.parent=550511 ;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=11593 rows=16560) (actual time=0.193..19.4 rows=6927 loops=1)
    -> Index lookup on jfs_edge using UQE_jfs_edge_edge (parent=550511)  (cost=5796 rows=16560) (actual time=0.167..8.97 rows=6927 loops=1)
    -> Single-row index lookup on jfs_node using PRIMARY (inode=jfs_edge.inode)  (cost=0.25 rows=1) (actual time=0.0014..0.00142 rows=1 loops=6927)
 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.02 sec)

mysql> show status like 'Innodb_buffer_pool_read_requests';
+----------------------------------+-----------+
| Variable_name                    | Value     |
+----------------------------------+-----------+
| Innodb_buffer_pool_read_requests | 650062961 |
+----------------------------------+-----------+
1 row in set (0.00 sec)

mysql> select 650062961-650013786;
+---------------------+
| 650062961-650013786 |
+---------------------+
|               49175 |
+---------------------+
1 row in set (0.00 sec)
### adjust primary key of jfs_node and create index on jfs_edge
mysql> alter table jfs_node add unique key (inode);
Query OK, 0 rows affected (5.12 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table jfs_node drop primary key ;
Query OK, 2376544 rows affected (8.37 sec)
Records: 2376544  Duplicates: 0  Warnings: 0

mysql> alter table jfs_node add primary key (parent,inode) ;
Query OK, 0 rows affected (14.42 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> Create  index  jfs_edge_parent_inode_name on jfs_edge (parent,inode,name) ;
Query OK, 0 rows affected (38.89 sec)
Records: 0  Duplicates: 0  Warnings: 0

### Optimized query with adjusted primary on jfs_node still cost 42k  buffer gets.
mysql> show status like 'Innodb_buffer_pool_read_requests';
+----------------------------------+-----------+
| Variable_name                    | Value     |
+----------------------------------+-----------+
| Innodb_buffer_pool_read_requests | 661286471 |
+----------------------------------+-----------+
1 row in set (0.00 sec)

mysql> Explain analyze   Select  jfs_edge.name,jfs_node.*  FROM  jfs_edge INNER JOIN jfs_node ON jfs_edge.inode=jfs_node.inode WHERE jfs_edge.parent= 550511 and 
    ->  jfs_node.parent = jfs_edge.parent  
    ->  Union all
    ->  Select jfs_edge.name,jfs_node.* from jfs_edge , jfs_node  where jfs_edge.parent= 550511  and jfs_edge.inode=jfs_node.inode and jfs_node.parent=0 ; 
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Append  (cost=35904 rows=27096) (actual time=0.0603..23.3 rows=6927 loops=1)
    -> Stream results  (cost=17952 rows=13548) (actual time=0.0595..14 rows=6927 loops=1)
        -> Nested loop inner join  (cost=17952 rows=13548) (actual time=0.0564..11.7 rows=6927 loops=1)
            -> Covering index lookup on jfs_edge using jfs_edge_parent_inode_name (parent=550511)  (cost=3049 rows=13548) (actual time=0.0475..1.2 rows=6927 loops=1)
            -> Single-row index lookup on jfs_node using PRIMARY (parent=550511, inode=jfs_edge.inode)  (cost=1 rows=1) (actual time=0.00142..0.00143 rows=1 loops=6927)
    -> Stream results  (cost=17952 rows=13548) (actual time=8.96..8.96 rows=0 loops=1)
        -> Nested loop inner join  (cost=17952 rows=13548) (actual time=8.96..8.96 rows=0 loops=1)
            -> Covering index lookup on jfs_edge using jfs_edge_parent_inode_name (parent=550511)  (cost=3049 rows=13548) (actual time=0.0258..1.25 rows=6927 loops=1)
            -> Single-row index lookup on jfs_node using PRIMARY (parent=0, inode=jfs_edge.inode)  (cost=1 rows=1) (actual time=0.00106..0.00106 rows=0 loops=6927)
 |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.03 sec)

mysql> show status like 'Innodb_buffer_pool_read_requests';
+----------------------------------+-----------+
| Variable_name                    | Value     |
+----------------------------------+-----------+
| Innodb_buffer_pool_read_requests | 661328473 |
+----------------------------------+-----------+
1 row in set (0.00 sec)

mysql> select 661328473-661286471;
+---------------------+
| 661328473-661286471 |
+---------------------+
|               42002 |
+---------------------+
1 row in set (0.00 sec)


Also tried to adjust primary key of jfs_edge to (parent,inode,name), the buffer get of my optimized query is still about 40-50k.
The approach to use jfs_node.inode (min,max) to limit index scan range does not work, or at least can not guarantee always work as expected since jfs_node.inode range under the same directory varies a lot , eg if we add a new file to the directory will change the max(inode) value.
Combined all above information, I guess it is hard to make SQL optimization fit for all RDBMS. Probably something at higher level to cache readdir results could be adopted, eg at application code layer to determine if end user's business can tolerate caching readdir result in application layer. #5462 may be good to give it a try.

@anysql
Copy link
Contributor

anysql commented Dec 30, 2024

Yes, it's hard to get a one-fit-all solution here.
Maybe we could chose not to support hard link, and combine jfs_edge & jfs_node into one table with all columns.

@frostwind
Copy link
Author

I think hard link support is still needed. The number of hard link is not many but it is used somehow. eg some data vendor implement multi-version with incremental data delivery is using hard link in our workload.

@anysql
Copy link
Contributor

anysql commented Dec 30, 2024

I think hard link support is still needed. The number of hard link is not many but it is used somehow. eg some data vendor implement multi-version with incremental data delivery is using hard link in our workload.

Clone is better for your case. Just think the windows file system (NTFS), hard link is not supported at all.

@anysql
Copy link
Contributor

anysql commented Dec 30, 2024

We are considering to control the concurrency of readdir for a directory to reduce the load by directory scan.

@frostwind
Copy link
Author

Can you introduce a bit of control concurrency of readdir ? If possible can we design the new concurrency control with a flag to disable/enable it? Maybe some user want to use more metadata DB's CPU rather than affecting overall TPS. And I tested readdir cache from #5462 , seems working good. I think we can close this issue , we don't need to go further on the SQL optimization approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants