Skip to content

Commit fead182

Browse files
committed
new doc
1 parent 7c63714 commit fead182

File tree

8 files changed

+237
-0
lines changed

8 files changed

+237
-0
lines changed

201802/20180205_01.md

Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
## PostgreSQL 与关系代数 (Equi-Join , Semi-Join , Anti-Join , Division)
2+
3+
### 作者
4+
digoal
5+
6+
### 日期
7+
2018-02-05
8+
9+
### 标签
10+
PostgreSQL , 关系代数 , EquiJoin , SemiJoin , AntiJoin , Division
11+
12+
----
13+
14+
## 背景
15+
关系数据库中很多操作来自关系代数中的一些概念。例如常见的JOIN操作,下面是关系代数中的一些概念。
16+
17+
https://en.wikipedia.org/wiki/Relational_algebra
18+
19+
JOIN本身也分好多种比如EquiJoin , SemiJoin , AntiJoin , Division。
20+
21+
## EquiJoin
22+
这种JOIN最为常见。例如:
23+
24+
```
25+
select a.* from a join b on (a.xx = b.xx);
26+
```
27+
28+
实际上关系代数中为θ-join,包括(<, ≤, =, >, ≥),当使用=时,对应的就是equijoin.
29+
30+
只要操作符(JOIN条件)返回TRUE,就输出对应的JOIN记录。(也可以理解为笛卡尔乘积中,仅返回JOIN条件为TRUE的那些)
31+
32+
## SemiJoin
33+
返回在Employee中的记录,同时这条记录与Dept中的所有记录一对多操作时,有一个返回TRUE的操作即可。
34+
35+
例如
36+
37+
```
38+
select * from Employee where exists
39+
(select 1 from Dept where Employee.DeptName = Dept.DeptName); -- 现实中操作符可以随意替代,代表不同语义
40+
```
41+
42+
![pic](20180205_01_pic_001.jpg)
43+
44+
由于semiJoin的操作在EXISTS中只要有一条符合TRUE即可,所以很大概率下并不需要扫描全量Dept。
45+
46+
semiJOIN支持hash, merge, nestloop几种JOIN方法。
47+
48+
Employee很小,并且Dept有索引时,NESTLOOP就会比较快。
49+
50+
Employee很大时,使用hash就很快。
51+
52+
PostgreSQL 11在hash操作上有了极大的性能提升:
53+
54+
[《PostgreSQL 11 preview - parallel hash (含hash JOIN , hash agg等) 性能极大提升》](../201802/20180201_01.md)
55+
56+
[《PostgreSQL dblink异步调用实现 并行hash分片JOIN - 含数据交、并、差 提速案例》](../201802/20180201_02.md)
57+
58+
## AntiJoin
59+
AntiJoin与SemiJoin表达的意思有点相反,要求Employee中的每一条记录,与Dept中所有记录进行操作后,Dept中没有任何一条能满足。返回在Employee中的这样的记录。
60+
61+
例如
62+
63+
```
64+
select * from Employee where not exists
65+
(select 1 from Dept where Employee.DeptName = Dept.DeptName); -- 现实中操作符可以随意替代,代表不同语义
66+
```
67+
68+
![pic](20180205_01_pic_002.jpg)
69+
70+
AntiJoin要求Employee中每一条记录与Dept所有记录进行操作,并且所有操作都不满足条件,这条算作有效记录,返回该Employee的记录。
71+
72+
对于JOIN操作符为=号的,不管是semijoin还是antijoin,都可以用HASH join,达到非常好的加速效果。
73+
74+
## Division
75+
JOIN中的除法运算,没有对应的SQL,需要写多条SQL或者使用CTE语法写一条SQL来实现。
76+
77+
![pic](20180205_01_pic_003.jpg)
78+
79+
![pic](20180205_01_pic_004.jpg)
80+
81+
1、补齐
82+
83+
tmp1:
84+
85+
```
86+
select Student, Task from
87+
(
88+
select distinct Student from Completed
89+
) t1
90+
,
91+
(
92+
select Task from DBProject
93+
) t2;
94+
```
95+
96+
2、使用AntiJoin计算余数
97+
98+
tmp2:
99+
100+
```
101+
select Student from Completed where not exists
102+
(select 1 from tmp1 where tmp1.Student=Completed.Student and tmp1.Task=Completed.Task);
103+
```
104+
105+
3、去重,并使用except求差,得到最终结果
106+
107+
```
108+
select distinct Student from Completed
109+
except
110+
select Student from tmp2;
111+
```
112+
113+
![pic](20180205_01_pic_005.jpg)
114+
115+
### CTE实现Division
116+
117+
```
118+
with
119+
t1 as (select distinct Student as Student from Completed),
120+
tmp1 as (select Student, Task from t1, (select Task from DBProject) t2),
121+
tmp2 as (select Student from Completed where not exists
122+
(select 1 from tmp1 where tmp1.Student=Completed.Student and tmp1.Task=Completed.Task)
123+
)
124+
select Student from t1
125+
except
126+
select Student from tmp2;
127+
```
128+
129+
### 除法求余
130+
131+
outerjoin不再赘述。
132+
133+
## Paralle HASH JOIN (equijoin, semijoin, antijoin)性能指标
134+
PostgreSQL 11
135+
136+
64线程机器,使用HASH并行。
137+
138+
测试数据:
139+
140+
```
141+
postgres=# create table a(id int);
142+
CREATE TABLE
143+
144+
postgres=# create table b(id int);
145+
CREATE TABLE
146+
147+
postgres=# insert into a select generate_series(1,100000000);
148+
INSERT 0 100000000
149+
150+
postgres=# insert into b select generate_series(1,1000000);
151+
INSERT 0 1000000
152+
```
153+
154+
### 1 Equi-Join
155+
156+
```
157+
postgres=# explain analyze select count(*) from a join b using (id);
158+
QUERY PLAN
159+
------------------------------------------------------------------------------------------------------------------------------------------------
160+
Finalize Aggregate (cost=469787.02..469787.03 rows=1 width=8) (actual time=902.399..902.399 rows=1 loops=1)
161+
-> Gather (cost=469780.45..469786.86 rows=64 width=8) (actual time=901.209..902.383 rows=65 loops=1)
162+
Workers Planned: 64
163+
Workers Launched: 64
164+
-> Partial Aggregate (cost=468780.45..468780.46 rows=1 width=8) (actual time=843.689..843.690 rows=1 loops=65)
165+
-> Parallel Hash Join (cost=4776.56..468741.38 rows=15625 width=0) (actual time=38.430..842.222 rows=15385 loops=65)
166+
Hash Cond: (a.id = b.id)
167+
-> Parallel Seq Scan on a (cost=0.00..458103.01 rows=1562500 width=4) (actual time=0.023..296.974 rows=1538462 loops=65)
168+
-> Parallel Hash (cost=4581.25..4581.25 rows=15625 width=4) (actual time=36.133..36.133 rows=15385 loops=65)
169+
Buckets: 1048576 Batches: 1 Memory Usage: 48832kB
170+
-> Parallel Seq Scan on b (cost=0.00..4581.25 rows=15625 width=4) (actual time=0.022..2.093 rows=15385 loops=65)
171+
Planning time: 0.117 ms
172+
Execution time: 990.915 ms
173+
(13 rows)
174+
```
175+
176+
### 2 Semi-join
177+
178+
```
179+
postgres=# explain analyze select count(*) from a where exists (select 1 from b where a.id=b.id);
180+
QUERY PLAN
181+
------------------------------------------------------------------------------------------------------------------------------------------------
182+
Finalize Aggregate (cost=468200.59..468200.60 rows=1 width=8) (actual time=890.449..890.449 rows=1 loops=1)
183+
-> Gather (cost=468194.02..468200.43 rows=64 width=8) (actual time=889.040..890.434 rows=65 loops=1)
184+
Workers Planned: 64
185+
Workers Launched: 64
186+
-> Partial Aggregate (cost=467194.02..467194.03 rows=1 width=8) (actual time=831.249..831.249 rows=1 loops=65)
187+
-> Parallel Hash Semi Join (cost=4776.56..467154.96 rows=15625 width=0) (actual time=37.204..829.763 rows=15385 loops=65)
188+
Hash Cond: (a.id = b.id)
189+
-> Parallel Seq Scan on a (cost=0.00..458103.01 rows=1562500 width=4) (actual time=0.024..289.738 rows=1538462 loops=65)
190+
-> Parallel Hash (cost=4581.25..4581.25 rows=15625 width=4) (actual time=35.672..35.672 rows=15385 loops=65)
191+
Buckets: 1048576 Batches: 1 Memory Usage: 48896kB
192+
-> Parallel Seq Scan on b (cost=0.00..4581.25 rows=15625 width=4) (actual time=0.023..2.090 rows=15385 loops=65)
193+
Planning time: 0.132 ms
194+
Execution time: 980.261 ms
195+
(13 rows)
196+
```
197+
198+
### 3 Anti-Join
199+
200+
```
201+
postgres=# explain analyze select count(*) from a where not exists (select 1 from b where a.id=b.id);
202+
QUERY PLAN
203+
------------------------------------------------------------------------------------------------------------------------------------------------
204+
Finalize Aggregate (cost=487341.22..487341.23 rows=1 width=8) (actual time=1171.201..1171.201 rows=1 loops=1)
205+
-> Gather (cost=487334.65..487341.06 rows=64 width=8) (actual time=1169.676..1171.185 rows=65 loops=1)
206+
Workers Planned: 64
207+
Workers Launched: 64
208+
-> Partial Aggregate (cost=486334.65..486334.66 rows=1 width=8) (actual time=1110.487..1110.487 rows=1 loops=65)
209+
-> Parallel Hash Anti Join (cost=4776.56..482467.46 rows=1546876 width=0) (actual time=53.768..964.692 rows=1523077 loops=65)
210+
Hash Cond: (a.id = b.id)
211+
-> Parallel Seq Scan on a (cost=0.00..458103.01 rows=1562500 width=4) (actual time=0.023..288.519 rows=1538462 loops=65)
212+
-> Parallel Hash (cost=4581.25..4581.25 rows=15625 width=4) (actual time=35.322..35.322 rows=15385 loops=65)
213+
Buckets: 1048576 Batches: 1 Memory Usage: 48864kB
214+
-> Parallel Seq Scan on b (cost=0.00..4581.25 rows=15625 width=4) (actual time=0.022..2.010 rows=15385 loops=65)
215+
Planning time: 0.129 ms
216+
Execution time: 1259.454 ms
217+
(13 rows)
218+
```
219+
220+
## 小结
221+
PostgreSQL的JOIN算法可圈可点,在版本11后,引入了parallel hash join,支持equijoin, semijoin, antijoin等各种关系计算。
222+
223+
性能杠杠的。
224+
225+
## 参考
226+
https://en.wikipedia.org/wiki/Relational_algebra
227+
228+
https://www.postgresql.org/message-id/flat/CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com#CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com
229+
230+
http://blog.itpub.net/15480802/viewspace-703260/
231+
232+
[《PostgreSQL 11 preview - parallel hash (含hash JOIN , hash agg等) 性能极大提升》](../201802/20180201_01.md)
233+
234+
[《PostgreSQL dblink异步调用实现 并行hash分片JOIN - 含数据交、并、差 提速案例》](../201802/20180201_02.md)
235+

201802/20180205_01_pic_001.jpg

52.7 KB
Loading

201802/20180205_01_pic_002.jpg

51.3 KB
Loading

201802/20180205_01_pic_003.jpg

46.2 KB
Loading

201802/20180205_01_pic_004.jpg

65.1 KB
Loading

201802/20180205_01_pic_005.jpg

24.2 KB
Loading

201802/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
### 文章列表
22
----
3+
##### 20180205_01.md [《PostgreSQL 与关系代数 (Equi-Join , Semi-Join , Anti-Join , Division)》](20180205_01.md)
34
##### 20180204_03.md [《PostgreSQL 11 preview - Parallel Append (多表并行计算) sharding架构并行计算核心功能之一》](20180204_03.md)
45
##### 20180204_02.md [《PostgreSQL 11 preview - 新功能, 分区表全局索引管理》](20180204_02.md)
56
##### 20180204_01.md [《PostgreSQL 11 preview - 并行排序、并行索引 (性能线性暴增) 单实例100亿TOP-K仅40秒》](20180204_01.md)

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ digoal's|PostgreSQL|文章|归类
2929

3030
### 未归类文档如下
3131
----
32+
##### 201802/20180205_01.md [《PostgreSQL 与关系代数 (Equi-Join , Semi-Join , Anti-Join , Division)》](201802/20180205_01.md)
3233
##### 201802/20180204_03.md [《PostgreSQL 11 preview - Parallel Append (多表并行计算) sharding架构并行计算核心功能之一》](201802/20180204_03.md)
3334
##### 201802/20180204_02.md [《PostgreSQL 11 preview - 新功能, 分区表全局索引管理》](201802/20180204_02.md)
3435
##### 201802/20180204_01.md [《PostgreSQL 11 preview - 并行排序、并行索引 (性能线性暴增) 单实例100亿TOP-K仅40秒》](201802/20180204_01.md)

0 commit comments

Comments
 (0)