|
| 1 | +## PostgreSQL 与关系代数 (Equi-Join , Semi-Join , Anti-Join , Division) |
| 2 | + |
| 3 | +### 作者 |
| 4 | +digoal |
| 5 | + |
| 6 | +### 日期 |
| 7 | +2018-02-05 |
| 8 | + |
| 9 | +### 标签 |
| 10 | +PostgreSQL , 关系代数 , EquiJoin , SemiJoin , AntiJoin , Division |
| 11 | + |
| 12 | +---- |
| 13 | + |
| 14 | +## 背景 |
| 15 | +关系数据库中很多操作来自关系代数中的一些概念。例如常见的JOIN操作,下面是关系代数中的一些概念。 |
| 16 | + |
| 17 | +https://en.wikipedia.org/wiki/Relational_algebra |
| 18 | + |
| 19 | +JOIN本身也分好多种比如EquiJoin , SemiJoin , AntiJoin , Division。 |
| 20 | + |
| 21 | +## EquiJoin |
| 22 | +这种JOIN最为常见。例如: |
| 23 | + |
| 24 | +``` |
| 25 | +select a.* from a join b on (a.xx = b.xx); |
| 26 | +``` |
| 27 | + |
| 28 | +实际上关系代数中为θ-join,包括(<, ≤, =, >, ≥),当使用=时,对应的就是equijoin. |
| 29 | + |
| 30 | +只要操作符(JOIN条件)返回TRUE,就输出对应的JOIN记录。(也可以理解为笛卡尔乘积中,仅返回JOIN条件为TRUE的那些) |
| 31 | + |
| 32 | +## SemiJoin |
| 33 | +返回在Employee中的记录,同时这条记录与Dept中的所有记录一对多操作时,有一个返回TRUE的操作即可。 |
| 34 | + |
| 35 | +例如 |
| 36 | + |
| 37 | +``` |
| 38 | +select * from Employee where exists |
| 39 | + (select 1 from Dept where Employee.DeptName = Dept.DeptName); -- 现实中操作符可以随意替代,代表不同语义 |
| 40 | +``` |
| 41 | + |
| 42 | + |
| 43 | + |
| 44 | +由于semiJoin的操作在EXISTS中只要有一条符合TRUE即可,所以很大概率下并不需要扫描全量Dept。 |
| 45 | + |
| 46 | +semiJOIN支持hash, merge, nestloop几种JOIN方法。 |
| 47 | + |
| 48 | +Employee很小,并且Dept有索引时,NESTLOOP就会比较快。 |
| 49 | + |
| 50 | +Employee很大时,使用hash就很快。 |
| 51 | + |
| 52 | +PostgreSQL 11在hash操作上有了极大的性能提升: |
| 53 | + |
| 54 | +[《PostgreSQL 11 preview - parallel hash (含hash JOIN , hash agg等) 性能极大提升》](../201802/20180201_01.md) |
| 55 | + |
| 56 | +[《PostgreSQL dblink异步调用实现 并行hash分片JOIN - 含数据交、并、差 提速案例》](../201802/20180201_02.md) |
| 57 | + |
| 58 | +## AntiJoin |
| 59 | +AntiJoin与SemiJoin表达的意思有点相反,要求Employee中的每一条记录,与Dept中所有记录进行操作后,Dept中没有任何一条能满足。返回在Employee中的这样的记录。 |
| 60 | + |
| 61 | +例如 |
| 62 | + |
| 63 | +``` |
| 64 | +select * from Employee where not exists |
| 65 | + (select 1 from Dept where Employee.DeptName = Dept.DeptName); -- 现实中操作符可以随意替代,代表不同语义 |
| 66 | +``` |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | +AntiJoin要求Employee中每一条记录与Dept所有记录进行操作,并且所有操作都不满足条件,这条算作有效记录,返回该Employee的记录。 |
| 71 | + |
| 72 | +对于JOIN操作符为=号的,不管是semijoin还是antijoin,都可以用HASH join,达到非常好的加速效果。 |
| 73 | + |
| 74 | +## Division |
| 75 | +JOIN中的除法运算,没有对应的SQL,需要写多条SQL或者使用CTE语法写一条SQL来实现。 |
| 76 | + |
| 77 | + |
| 78 | + |
| 79 | + |
| 80 | + |
| 81 | +1、补齐 |
| 82 | + |
| 83 | +tmp1: |
| 84 | + |
| 85 | +``` |
| 86 | +select Student, Task from |
| 87 | +( |
| 88 | + select distinct Student from Completed |
| 89 | +) t1 |
| 90 | +, |
| 91 | +( |
| 92 | + select Task from DBProject |
| 93 | +) t2; |
| 94 | +``` |
| 95 | + |
| 96 | +2、使用AntiJoin计算余数 |
| 97 | + |
| 98 | +tmp2: |
| 99 | + |
| 100 | +``` |
| 101 | +select Student from Completed where not exists |
| 102 | + (select 1 from tmp1 where tmp1.Student=Completed.Student and tmp1.Task=Completed.Task); |
| 103 | +``` |
| 104 | + |
| 105 | +3、去重,并使用except求差,得到最终结果 |
| 106 | + |
| 107 | +``` |
| 108 | +select distinct Student from Completed |
| 109 | +except |
| 110 | +select Student from tmp2; |
| 111 | +``` |
| 112 | + |
| 113 | + |
| 114 | + |
| 115 | +### CTE实现Division |
| 116 | + |
| 117 | +``` |
| 118 | +with |
| 119 | + t1 as (select distinct Student as Student from Completed), |
| 120 | + tmp1 as (select Student, Task from t1, (select Task from DBProject) t2), |
| 121 | + tmp2 as (select Student from Completed where not exists |
| 122 | + (select 1 from tmp1 where tmp1.Student=Completed.Student and tmp1.Task=Completed.Task) |
| 123 | + ) |
| 124 | +select Student from t1 |
| 125 | +except |
| 126 | +select Student from tmp2; |
| 127 | +``` |
| 128 | + |
| 129 | +### 除法求余 |
| 130 | + |
| 131 | +outerjoin不再赘述。 |
| 132 | + |
| 133 | +## Paralle HASH JOIN (equijoin, semijoin, antijoin)性能指标 |
| 134 | +PostgreSQL 11 |
| 135 | + |
| 136 | +64线程机器,使用HASH并行。 |
| 137 | + |
| 138 | +测试数据: |
| 139 | + |
| 140 | +``` |
| 141 | +postgres=# create table a(id int); |
| 142 | +CREATE TABLE |
| 143 | + |
| 144 | +postgres=# create table b(id int); |
| 145 | +CREATE TABLE |
| 146 | + |
| 147 | +postgres=# insert into a select generate_series(1,100000000); |
| 148 | +INSERT 0 100000000 |
| 149 | + |
| 150 | +postgres=# insert into b select generate_series(1,1000000); |
| 151 | +INSERT 0 1000000 |
| 152 | +``` |
| 153 | + |
| 154 | +### 1 Equi-Join |
| 155 | + |
| 156 | +``` |
| 157 | +postgres=# explain analyze select count(*) from a join b using (id); |
| 158 | + QUERY PLAN |
| 159 | +------------------------------------------------------------------------------------------------------------------------------------------------ |
| 160 | + Finalize Aggregate (cost=469787.02..469787.03 rows=1 width=8) (actual time=902.399..902.399 rows=1 loops=1) |
| 161 | + -> Gather (cost=469780.45..469786.86 rows=64 width=8) (actual time=901.209..902.383 rows=65 loops=1) |
| 162 | + Workers Planned: 64 |
| 163 | + Workers Launched: 64 |
| 164 | + -> Partial Aggregate (cost=468780.45..468780.46 rows=1 width=8) (actual time=843.689..843.690 rows=1 loops=65) |
| 165 | + -> Parallel Hash Join (cost=4776.56..468741.38 rows=15625 width=0) (actual time=38.430..842.222 rows=15385 loops=65) |
| 166 | + Hash Cond: (a.id = b.id) |
| 167 | + -> Parallel Seq Scan on a (cost=0.00..458103.01 rows=1562500 width=4) (actual time=0.023..296.974 rows=1538462 loops=65) |
| 168 | + -> Parallel Hash (cost=4581.25..4581.25 rows=15625 width=4) (actual time=36.133..36.133 rows=15385 loops=65) |
| 169 | + Buckets: 1048576 Batches: 1 Memory Usage: 48832kB |
| 170 | + -> Parallel Seq Scan on b (cost=0.00..4581.25 rows=15625 width=4) (actual time=0.022..2.093 rows=15385 loops=65) |
| 171 | + Planning time: 0.117 ms |
| 172 | + Execution time: 990.915 ms |
| 173 | +(13 rows) |
| 174 | +``` |
| 175 | + |
| 176 | +### 2 Semi-join |
| 177 | + |
| 178 | +``` |
| 179 | +postgres=# explain analyze select count(*) from a where exists (select 1 from b where a.id=b.id); |
| 180 | + QUERY PLAN |
| 181 | +------------------------------------------------------------------------------------------------------------------------------------------------ |
| 182 | + Finalize Aggregate (cost=468200.59..468200.60 rows=1 width=8) (actual time=890.449..890.449 rows=1 loops=1) |
| 183 | + -> Gather (cost=468194.02..468200.43 rows=64 width=8) (actual time=889.040..890.434 rows=65 loops=1) |
| 184 | + Workers Planned: 64 |
| 185 | + Workers Launched: 64 |
| 186 | + -> Partial Aggregate (cost=467194.02..467194.03 rows=1 width=8) (actual time=831.249..831.249 rows=1 loops=65) |
| 187 | + -> Parallel Hash Semi Join (cost=4776.56..467154.96 rows=15625 width=0) (actual time=37.204..829.763 rows=15385 loops=65) |
| 188 | + Hash Cond: (a.id = b.id) |
| 189 | + -> Parallel Seq Scan on a (cost=0.00..458103.01 rows=1562500 width=4) (actual time=0.024..289.738 rows=1538462 loops=65) |
| 190 | + -> Parallel Hash (cost=4581.25..4581.25 rows=15625 width=4) (actual time=35.672..35.672 rows=15385 loops=65) |
| 191 | + Buckets: 1048576 Batches: 1 Memory Usage: 48896kB |
| 192 | + -> Parallel Seq Scan on b (cost=0.00..4581.25 rows=15625 width=4) (actual time=0.023..2.090 rows=15385 loops=65) |
| 193 | + Planning time: 0.132 ms |
| 194 | + Execution time: 980.261 ms |
| 195 | +(13 rows) |
| 196 | +``` |
| 197 | + |
| 198 | +### 3 Anti-Join |
| 199 | + |
| 200 | +``` |
| 201 | +postgres=# explain analyze select count(*) from a where not exists (select 1 from b where a.id=b.id); |
| 202 | + QUERY PLAN |
| 203 | +------------------------------------------------------------------------------------------------------------------------------------------------ |
| 204 | + Finalize Aggregate (cost=487341.22..487341.23 rows=1 width=8) (actual time=1171.201..1171.201 rows=1 loops=1) |
| 205 | + -> Gather (cost=487334.65..487341.06 rows=64 width=8) (actual time=1169.676..1171.185 rows=65 loops=1) |
| 206 | + Workers Planned: 64 |
| 207 | + Workers Launched: 64 |
| 208 | + -> Partial Aggregate (cost=486334.65..486334.66 rows=1 width=8) (actual time=1110.487..1110.487 rows=1 loops=65) |
| 209 | + -> Parallel Hash Anti Join (cost=4776.56..482467.46 rows=1546876 width=0) (actual time=53.768..964.692 rows=1523077 loops=65) |
| 210 | + Hash Cond: (a.id = b.id) |
| 211 | + -> Parallel Seq Scan on a (cost=0.00..458103.01 rows=1562500 width=4) (actual time=0.023..288.519 rows=1538462 loops=65) |
| 212 | + -> Parallel Hash (cost=4581.25..4581.25 rows=15625 width=4) (actual time=35.322..35.322 rows=15385 loops=65) |
| 213 | + Buckets: 1048576 Batches: 1 Memory Usage: 48864kB |
| 214 | + -> Parallel Seq Scan on b (cost=0.00..4581.25 rows=15625 width=4) (actual time=0.022..2.010 rows=15385 loops=65) |
| 215 | + Planning time: 0.129 ms |
| 216 | + Execution time: 1259.454 ms |
| 217 | +(13 rows) |
| 218 | +``` |
| 219 | + |
| 220 | +## 小结 |
| 221 | +PostgreSQL的JOIN算法可圈可点,在版本11后,引入了parallel hash join,支持equijoin, semijoin, antijoin等各种关系计算。 |
| 222 | + |
| 223 | +性能杠杠的。 |
| 224 | + |
| 225 | +## 参考 |
| 226 | +https://en.wikipedia.org/wiki/Relational_algebra |
| 227 | + |
| 228 | +https://www.postgresql.org/message-id/flat/CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com#CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com |
| 229 | + |
| 230 | +http://blog.itpub.net/15480802/viewspace-703260/ |
| 231 | + |
| 232 | +[《PostgreSQL 11 preview - parallel hash (含hash JOIN , hash agg等) 性能极大提升》](../201802/20180201_01.md) |
| 233 | + |
| 234 | +[《PostgreSQL dblink异步调用实现 并行hash分片JOIN - 含数据交、并、差 提速案例》](../201802/20180201_02.md) |
| 235 | + |
0 commit comments