-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
listdir/readdir SQL may be get improved #5417
Comments
Also may need to consider jfs_edge.parent!=jfs_node.parent. Is it hardlink?
Considering existence of hard link, we may modify the query to
pgbench , about 10% CPU usage.
We also need to be cautious on a file system with many hardlinks, which will degrade parent=0 subquery. Let me double check its performance. |
Considering hard link, we may use below query (only selecting jfs_edge.name since juicefs main branch seems already make changes to only select jfs_edge.name instead of jfs_edge.* ).
Below index are also needed
If database does not support covered index, "jfs_node_multi" index can be converted to normal index with all columns, or just include (parent,inode) columns from jfs_node , can also reduce a lot buffer gets.
In previous test, I create copied a directory with 14k hard links.
Then I use below script to generate 1 million hard links against the same file. The query plan and buffer gets still keep stable
|
Ran another test to create 10k files, then create 1000 directories , each directory contains 10k hard link point to the 10k file created before.
Seems juicefs use prepared statement, so test it with prepared statement. Query plan still keep stable with a few hundreds to 3000 buffer gets. Also equipped aggressive vacuum parameter in postgresql 17 to make vacuum on jfs_node to happen more frequently to avoid too much buffer gets waste on "Heap Fetches".
To avoid the jfs_node_multi generating too much "heap fetches" in PostgreSQL, I am using below setting to reduce frequency of autovacuum
Also setup cronjob with "(INDEX_CLEANUP on)" vacuum setting to allow it clean up dead tuple in index.
A simpler way is to enable such setting as table level, probably we just need this vacuum_index_cleanup for jfs_node.
Anyway above special vacuum setting is just the pain for PostgreSQL user. If juicefs can implement query rewriting with proper index setup, and provide above guidance on vacuum setting to PostgreSQL user. That would be good enough.
In PostgreSQL we use a "include" index.
If we have difficulty to create different index syntax in different DB, can just create the covered index as MySQL and let decision to PostgreSQL user if they want to convert that covered index to an "include" index. |
Thank you for sharing the profile results. We will delve deeper and improve this. |
There is problem for your query, as the hard link may created in different directory, so they have different parent node. |
@anysql
|
Inode is the primary key of jfs_node table, use nest loop join is enough. High cost due to merge join used, may be you should analyze your tables? |
@anysql
Which is costly since about 27k buffer gets * 8K block size= 216MB data scanned , this is consuming about 30 ms DB CPU in idle time and over 100ms DB CPU in busy time. pgbench shows original query can only reach 300+ TPS with 80% DB CPU usage. |
Can you check the cost of nest loop join in pg? |
ListDir and ReadDir usually read in batch, batch size is limited by fuse, may be few KBs. |
Actually the query plan with "merge join" is good query plan after rewriting query and adding index. The original slow query does not use "merge join". Let me copy paste them here to make it clear:
This is a common problem for RDBMS, I expect MySQL should have similar benchmark. |
As both jfs_edge and jfs_node have just few columns, create index with much columns will slow down the update operation. For pg it may hit vacum performance problem. |
I understand your concern about a wide index may introduce slowness on update. In practice, we use PG 17 , and not seeing vacuum can cause much trouble in our workload. In later PG11+ version , vacuum speed/performance get improved a lot.
The 2nd query contain a long IN-list with 6927 items (exactly the size of my directory) with 24k buffer gets. It has similar query plan as old query , the only benefit is only selecting "jfs_edge".name to save some buffer gets, but "jfs_node" is still contributing high buffer gets. In some PG version , such long In-list could cause memory leak with query parsing slowness. From DBA's point, we should be cautious on potential performance hit from such long IN-list query in database. Update: MySQL batch key access feature may improve the read amplification but I haven't tested it yet. I only have Postgres env. |
+1 for this performance issue, this Another issue is that the SQL engine generates a large number of connections. We have a cluster of ~10 clients, which results in tens of thousands of connections to the database. And most of them are short-lived connections. |
The pgbench of readdir against dir with 6.9k symlink , is tested on bare metal DB with 48 cores CPU , it is consuming 80% CPU of all 48 cores, only reach 300+ TPS. |
Please include the inode column for index "UQE_jfs_edge_edge", or adding a new index by (parent + inode). Include syntax is not supported by all databases, so you need to do it by your self. Can you bench the following query with merge join SELECT * FROM "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode |
With below two indexes.
The query performance is still not good if we use
And I migrate the same data set from PG 17 to MySQL 8.4, found the original query cost over 49k buffer gets in MySQL (show status 'Innodb_buffer_pool_read_requests'), after I modify primary key of jfs_node to (parent,inode) along with unique index on (inode), read amplification on jfs_node is gone, but instead it use jfs_node as driving table and read amplification goes to jfs_edge(not always, below test case is not reflecting it). eg
Also tried to adjust primary key of jfs_edge to (parent,inode,name), the buffer get of my optimized query is still about 40-50k. |
Yes, it's hard to get a one-fit-all solution here. |
I think hard link support is still needed. The number of hard link is not many but it is used somehow. eg some data vendor implement multi-version with incremental data delivery is using hard link in our workload. |
Clone is better for your case. Just think the windows file system (NTFS), hard link is not supported at all. |
We are considering to control the concurrency of readdir for a directory to reduce the load by directory scan. |
Can you introduce a bit of control concurrency of readdir ? If possible can we design the new concurrency control with a flag to disable/enable it? Maybe some user want to use more metadata DB's CPU rather than affecting overall TPS. And I tested readdir cache from #5462 , seems working good. I think we can close this issue , we don't need to go further on the SQL optimization approach. |
What would you like to be added:
If client use readdir or os.listdir in python, it will call below query in juicefs metadata DB:
This query can consume significant DB CPU in some scenarios. Eg, one of our workload will call this query against a directory with 6.9k symlinks repeatedly and this query consume 90% of overall CPU among all juicefs queries with 100ms avg time cost under high load.
Seems both jfs_edge and jfs_node include "parent" column, it is possible to create a covered index on jfs_node on parent column and rewrite the query, eg in PostgreSQL
Using pgbench with old SQL shows 334 tps and used 80% of CPU of a 48 CPU cores DB.
Using pgbench with new SQL and new index shows 522 tps and only used 10% of CPU of a 48 CPU cores DB.
In PostgreSQL, for covered index, there may be performance degradation when "vacuum" is not happening in a timely manner and cause tuple visibility issue, if we configure "vacuum" properly, the chance to encounter such degradation should be minimal.
In MySQL maybe there is no "include" syntax but could create a normal index with all the necessary columns can meet the same goal.
Why is this needed:
The text was updated successfully, but these errors were encountered: