Skip to content

Files

Latest commit

Dec 15, 2023
f400df2 · Dec 15, 2023

History

History
339 lines (218 loc) · 13.6 KB

20201110_05.md

File metadata and controls

339 lines (218 loc) · 13.6 KB

PostgreSQL 14 preview - Hybrid Hash/Nested Loop joins and caching results from subplans - cache用于join的innter table中间结果

作者

digoal

日期

2020-11-10

标签

PostgreSQL , cache result


背景

Hybrid Hash/Nested Loop joins and caching results from subplans

解决

out a  (与B关联的value大量重复) 例如1-100的value有100000条    
  inner b (小表) 例如包含的数据范围是1-100, 循环  100000 次      

导致b循环次数过多的情况.

cache result相比 b index scan还是快很多.

https://commitfest.postgresql.org/30/2569/

https://www.postgresql.org/message-id/flat/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com

Hackers,    
    
    
    
Over on [1], Heikki mentioned about the usefulness of caching results    
from parameterized subplans so that they could be used again for    
subsequent scans which have the same parameters as a previous scan.    
On [2], I mentioned that parameterized nested loop joins could see    
similar gains with such a cache. I suggested there that instead of    
adding code that only allows this to work for subplans, that instead,    
we add a new node type that can handle the caching for us.  We can    
then just inject that node type in places where it seems beneficial.    
    
    
    
I've attached a patch which implements this.  The new node type is    
called "Result Cache".  I'm not particularly wedded to keeping that    
name, but if I change it, I only want to do it once. I've got a few    
other names I mind, but I don't feel strongly or confident enough in    
them to go and do the renaming.    
    
    
    
How the caching works:    
    
    
    
First off, it's only good for plugging in on top of parameterized    
nodes that are rescanned with different parameters. The cache itself    
uses a hash table using the simplehash.h implementation.  The memory    
consumption is limited to work_mem. The code maintains an LRU list and    
when we need to add new entries but don't have enough space to do so,    
we free off older items starting at the top of the LRU list.  When we    
get a cache hit, we move that entry to the end of the LRU list so that    
it'll be the last to be evicted.    
    
    
    
When should we cache:    
    
    
    
For nested loop joins, the decision is made purely based on cost.  The    
costing model looks at the expected number of calls, the distinct    
value estimate and work_mem size. It then determines how many items    
can be cached and then goes on to estimate an expected cache hit ratio    
and also an eviction ratio. It adjusts the input costs according to    
those ratios and adds some additional charges for caching and cache    
lookups.    
    
    
    
For subplans, since we plan subplans before we're done planning the    
outer plan, there's very little information to go on about the number    
of times that the cache will be looked up. For now, I've coded things    
so the cache is always used for EXPR_SUBLINK type subplans.  There may    
be other types of subplan that could support caching too, but I've not    
really gone through them all yet to determine which.  I certainly know    
there's some that we can't cache for.    
    
    
    
Why caching might be good:    
    
    
    
With hash joins, it's sometimes not so great that we have to hash the    
entire inner plan and only probe a very small number of values.  If we    
were able to only fill the hash table with values that are needed,    
then then a lot of time and memory could be saved.  Effectively, the    
patch does exactly this with the combination of a parameterized nested    
loop join with a Result Cache node above the inner scan.    
    
    
    
For subplans, the gains can be more because often subplans are much    
more expensive to execute than what might go on the inside of a    
parameterized nested loop join.    
    
    
    
Current problems and some ways to make it better:    
    
    
    
The patch does rely heavily on good ndistinct estimates. One    
unfortunate problem is that if the planner has no statistics for    
whatever it's trying to estimate for, it'll default to returning    
DEFAULT_NUM_DISTINCT (200).  That may cause the Result Cache to appear    
much more favourable than it should.  One way I can think to work    
around that would be to have another function similar to    
estimate_num_groups() which accepts a default value which it will    
return if it was unable to find statistics to use.  In this case, such    
a function could just be called passing the number of input rows as    
the default, which would make the costing code think each value is    
unique, which would not be favourable for caching.  I've not done    
anything like that in what I've attached here.  That solution would    
also do nothing if the ndistinct estimate was available, but was just    
incorrect, as it often is.    
    
    
    
There are currently a few compiler warnings with the patch due to the    
scope of the simplehash.h hash table. Because the scope is static    
rather than extern there's a load of unused function warnings.  Not    
sure yet the best way to deal with this. I don't want to change the    
scope to extern just to keep compilers quiet.    
    
    
    
Also during cache_reduce_memory(), I'm performing a hash table lookup    
followed by a hash table delete. I already have the entry to delete,    
but there's no simplehash.h function that allows deletion by element    
pointer, only by key. This wastes a hash table lookup. I'll likely    
make an adjustment to the simplehash.h code to export the delete code    
as a separate function to fix this.    
    
    
    
Demo:    
    
    
    
# explain (analyze, costs off) select relname,(select count(*) from    
pg_class c2 where c1.relkind = c2.relkind) from pg_class c1;    
                                       QUERY PLAN    
----------------------------------------------------------------------------------------    
 Seq Scan on pg_class c1 (actual time=0.069..0.470 rows=391 loops=1)    
   SubPlan 1    
     ->  Result Cache (actual time=0.001..0.001 rows=1 loops=391)    
           Cache Key: c1.relkind    
           Cache Hits: 387  Cache Misses: 4 Cache Evictions: 0  Cache    
Overflows: 0    
           ->  Aggregate (actual time=0.062..0.062 rows=1 loops=4)    
                 ->  Seq Scan on pg_class c2 (actual time=0.007..0.056    
rows=98 loops=4)    
                       Filter: (c1.relkind = relkind)    
                       Rows Removed by Filter: 293    
 Planning Time: 0.047 ms    
 Execution Time: 0.536 ms    
(11 rows)    
    
    
    
# set enable_resultcache=0; -- disable result caching    
SET    
# explain (analyze, costs off) select relname,(select count(*) from    
pg_class c2 where c1.relkind = c2.relkind) from pg_class c1;    
                                     QUERY PLAN    
-------------------------------------------------------------------------------------    
 Seq Scan on pg_class c1 (actual time=0.070..24.619 rows=391 loops=1)    
   SubPlan 1    
     ->  Aggregate (actual time=0.062..0.062 rows=1 loops=391)    
           ->  Seq Scan on pg_class c2 (actual time=0.009..0.056    
rows=120 loops=391)    
                 Filter: (c1.relkind = relkind)    
                 Rows Removed by Filter: 271    
 Planning Time: 0.042 ms    
 Execution Time: 24.653 ms    
(8 rows)    
    
    
    
-- Demo with parameterized nested loops    
create table hundredk (hundredk int, tenk int, thousand int, hundred    
int, ten int, one int);    
insert into hundredk select x%100000,x%10000,x%1000,x%100,x%10,1 from    
generate_Series(1,100000) x;    
create table lookup (a int);    
insert into lookup select x from generate_Series(1,100000)x,    
generate_Series(1,100);    
create index on lookup(a);    
vacuum analyze lookup, hundredk;    
    
    
    
# explain (analyze, costs off) select count(*) from hundredk hk inner    
join lookup l on hk.thousand = l.a;    
                                                   QUERY PLAN    
-----------------------------------------------------------------------------------------------------------------    
 Aggregate (actual time=1876.710..1876.710 rows=1 loops=1)    
   ->  Nested Loop (actual time=0.013..1371.690 rows=9990000 loops=1)    
         ->  Seq Scan on hundredk hk (actual time=0.005..8.451    
rows=100000 loops=1)    
         ->  Result Cache (actual time=0.000..0.005 rows=100 loops=100000)    
               Cache Key: hk.thousand    
               Cache Hits: 99000  Cache Misses: 1000 Cache Evictions:    
0  Cache Overflows: 0    
               ->  Index Only Scan using lookup_a_idx on lookup l    
(actual time=0.002..0.011 rows=100 loops=1000)    
                     Index Cond: (a = hk.thousand)    
                     Heap Fetches: 0    
 Planning Time: 0.113 ms    
 Execution Time: 1876.741 ms    
(11 rows)    
    
    
    
# set enable_resultcache=0;    
SET    
# explain (analyze, costs off) select count(*) from hundredk hk inner    
join lookup l on hk.thousand = l.a;    
                                                QUERY PLAN    
-----------------------------------------------------------------------------------------------------------    
 Aggregate (actual time=2401.351..2401.352 rows=1 loops=1)    
   ->  Merge Join (actual time=28.412..1890.905 rows=9990000 loops=1)    
         Merge Cond: (l.a = hk.thousand)    
         ->  Index Only Scan using lookup_a_idx on lookup l (actual    
time=0.005..10.170 rows=99901 loops=1)    
               Heap Fetches: 0    
         ->  Sort (actual time=28.388..576.783 rows=9990001 loops=1)    
               Sort Key: hk.thousand    
               Sort Method: quicksort  Memory: 7760kB    
               ->  Seq Scan on hundredk hk (actual time=0.005..11.039    
rows=100000 loops=1)    
 Planning Time: 0.123 ms    
 Execution Time: 2401.379 ms    
(11 rows)    
    
    
    
Cache Overflows:    
    
    
    
You might have noticed "Cache Overflow" in the EXPLAIN ANALYZE output.    
This happens if a single scan of the inner node exhausts the cache    
memory. In this case, all the other entries will already have been    
evicted in an attempt to make space for the current scan's tuples.    
However, if we see an overflow then the size of the results from a    
single scan alone must have exceeded work_mem.  There might be some    
tweaking to do here as it seems a shame that a single overly larger    
scan would flush the entire cache. I doubt it would be too hard to    
limit the flushing to some percentage of work_mem. Similar to how    
large seqscans don't entirely flush shared_buffers.    
    
    
    
Current Status:    
    
    
    
I've spent quite a bit of time getting this working. I'd like to take    
a serious go at making this happen for PG14.  For now, it all seems to    
work. I have some concerns about bad statistics causing nested loop    
joins to be favoured more than they were previously due to the result    
cache further lowering the cost of them when the cache hit ratio is    
thought to be high.    
    
    
    
For now, the node type is parallel_safe, but not parallel_aware. I can    
see that a parallel_aware version would be useful, but I've not done    
that here. Anything in that area will not be part of my initial    
effort.  The unfortunate part about that is the actual hit ratio will    
drop with more parallel workers since the caches of each worker are    
separate.    
    
    
    
Some tests show a 10x speedup on TPC-H Q2.    
    
    
    
I'm interested in getting feedback on this before doing much further work on it.    
    
    
    
Does it seem like something we might want for PG14?    
    
    
    
David    
    
    
    
[1] https://www.postgresql.org/message-id/daceb327-9a20-51f4-fe6c-60b898692305%40iki.fi    
[2] https://www.postgresql.org/message-id/CAKJS1f8oNXQ-LqjK%3DBOFDmxLc_7s3uFr_g4qi7Ncrjig0JOCiA%40mail.gmail.com    

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

digoal's wechat