Skip to content

Commit c1351f6

Browse files
digoal zhoudigoal zhou
authored andcommitted
new doc
1 parent a345ceb commit c1351f6

File tree

5 files changed

+271
-5
lines changed

5 files changed

+271
-5
lines changed

202208/20220828_01.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -149,9 +149,11 @@ D PRAGMA profiling_mode='detailed';
149149
D PRAGMA profile_output='tpch.profile';
150150
D .timer on
151151
152-
将下一个执行的结果重定向到my_results.txt
153-
D .once my_results.txt
152+
将执行结果重定向到my_results.txt
153+
D .output my_results.txt
154154
155+
执行SQL
156+
D .read tpch.sql
155157
Run Time: real 0.020 user 0.083072 sys 0.000901
156158
Run Time: real 0.013 user 0.016175 sys 0.001734
157159
Run Time: real 0.017 user 0.021799 sys 0.004781
@@ -174,9 +176,6 @@ Run Time: real 0.011 user 0.040045 sys 0.000583
174176
Run Time: real 0.017 user 0.047979 sys 0.005695
175177
Run Time: real 0.035 user 0.086615 sys 0.030360
176178
Run Time: real 0.011 user 0.013999 sys 0.003183
177-
178-
执行SQL
179-
D .read tpch.sql
180179
```
181180

182181
查询结果:
@@ -185,6 +184,15 @@ D .read tpch.sql
185184

186185
profile_output:
187186
- This file is overwritten with each query that is issued. If you want to store the profile output for later it should be copied to a different file.
187+
188+
dbgen用法:
189+
```
190+
D select * from duckdb_functions() where function_name='dbgen';
191+
| schema_name | function_name | function_type | description | return_type | parameters | parameter_types | varargs | macro_definition | has_side_effects |
192+
|-------------|---------------|---------------|-------------|-------------|---------------------------------|-------------------------------------|---------|------------------|------------------|
193+
| main | dbgen | table | | | [suffix, schema, overwrite, sf] | [VARCHAR, VARCHAR, BOOLEAN, DOUBLE] | | | |
194+
Run Time (s): real 0.012 user 0.012088 sys 0.000242
195+
```
188196

189197

190198
#### [期望 PostgreSQL 增加什么功能?](https://github.com/digoal/blog/issues/76 "269ac3d1c492e938c0191101c7238216")

202209/20220901_05.md

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
## DuckDB 采用外部 parquet 格式存储 - tpch 测试 - in_memory VS in_parquet
2+
3+
### 作者
4+
digoal
5+
6+
### 日期
7+
2022-09-01
8+
9+
### 标签
10+
PostgreSQL , DuckDB , parquet , tpch
11+
12+
----
13+
14+
## 背景
15+
采用外部存储(parquet), tpch 测试: `sf = 10`
16+
17+
数据生成和tpch sql的生成参考:
18+
- [《DuckDB TPC-H, TPC-DS 测试》](../202208/20220828_01.md)
19+
20+
为什么要测试parquet格式, 参考:
21+
- [《DuckDB 数据库的数据能不能超出内存限制? 以及推荐的使用方法 - parquet》](../202209/20220901_03.md)
22+
23+
DuckDB 推荐用法:
24+
- 数据都尽量存储在parquet中, 内存干嘛用呢? 计算时, hash table, 排序等用到内存. 这样duckdb能管理的数据就可以无限大.
25+
- duckdb自己的数据文件会膨胀, 由于数据都存储在parquet中, 膨胀后建议导出schema定义, 重新启动一份新的数据文件然后导入schema, 老的数据文件删掉即可.
26+
- 由于parquet支持pushdown filter\projection, 支持分区, 支持并行. 查询速度是非常快的. 甚至可以使用远程parquet文件存储 (s3, https, oss) .
27+
28+
启动duckdb, 使用持久化文件启动
29+
30+
```
31+
$ ./duckdb ./digoal.db.new
32+
v0.4.1-dev2371 3825e0ee7
33+
Enter ".help" for usage hints.
34+
```
35+
36+
加载tpch, 并生成数据 (dbgen不支持直接生成外部parquet文件, 所以只能先过一道数据库.)
37+
38+
```
39+
D install tpch;
40+
D load tpch;
41+
D copy (select query from tpch_queries()) to 'tpch.sql' with (quote '');
42+
D call dbgen(sf='10');
43+
```
44+
45+
导出为parquet文件
46+
47+
```
48+
D EXPORT DATABASE '/Users/digoal/duckdb/build/release/tpch_20220901' (FORMAT PARQUET);
49+
50+
D .quit
51+
52+
drwxr-xr-x 3 digoal staff 96B Sep 1 17:16 digoal.db.new.tmp
53+
-rw-r--r-- 1 digoal staff 5.1G Sep 1 17:18 digoal.db.new
54+
-rw-r--r-- 1 digoal staff 0B Sep 1 17:18 digoal.db.new.wal
55+
```
56+
57+
生成过程占用了大量swap
58+
59+
```
60+
IT-C02YW2EFLVDL:release digoal$ sysctl vm.swapusage
61+
vm.swapusage: total = 12288.00M used = 11334.00M free = 954.00M (encrypted)
62+
```
63+
64+
使用当前格式跑一下tpch结果, 然后就可以删除digoal.db.new数据库文件
65+
66+
```
67+
配置profile, 输出重定向等.
68+
D PRAGMA enable_profiling='QUERY_TREE_OPTIMIZER';
69+
D PRAGMA enable_optimizer;
70+
D PRAGMA explain_output='all';
71+
D PRAGMA profiling_mode='detailed';
72+
D PRAGMA profile_output='tpch.profile';
73+
D .timer on
74+
D .output my_results.txt
75+
76+
执行SQL
77+
D .read tpch.sql
78+
79+
Run Time (s): real 0.741 user 5.706895 sys 0.014485
80+
Run Time (s): real 0.239 user 1.222556 sys 0.082958
81+
Run Time (s): real 0.436 user 3.036656 sys 0.060910
82+
Run Time (s): real 1.159 user 4.839926 sys 0.586248
83+
Run Time (s): real 0.421 user 3.028173 sys 0.029702
84+
Run Time (s): real 0.208 user 1.574725 sys 0.004798
85+
Run Time (s): real 1.505 user 9.384255 sys 0.556902
86+
Run Time (s): real 0.453 user 3.273303 sys 0.024077
87+
Run Time (s): real 2.329 user 16.336593 sys 0.262405
88+
Run Time (s): real 0.887 user 5.570960 sys 0.409520
89+
Run Time (s): real 0.115 user 0.440873 sys 0.023601
90+
Run Time (s): real 0.417 user 3.124678 sys 0.016045
91+
Run Time (s): real 1.107 user 7.185678 sys 0.273737
92+
Run Time (s): real 0.333 user 2.278771 sys 0.077510
93+
Run Time (s): real 1.063 user 8.115284 sys 0.072995
94+
Run Time (s): real 0.631 user 0.912368 sys 0.101576
95+
Run Time (s): real 2.208 user 15.257795 sys 0.900340
96+
Run Time (s): real 3.588 user 22.501919 sys 2.134568
97+
Run Time (s): real 0.760 user 5.589818 sys 0.082662
98+
Run Time (s): real 1.632 user 9.039679 sys 0.592879
99+
Run Time (s): real 3.296 user 14.681672 sys 2.855988
100+
Run Time (s): real 0.440 user 2.177325 sys 0.274837
101+
102+
rm -rf digoal.db.new*
103+
```
104+
105+
106+
进入parquet导出的目录, 查看文件列表
107+
108+
```
109+
cd /Users/digoal/duckdb/build/release/tpch_20220901
110+
111+
IT-C02YW2EFLVDL:tpch_20220901 digoal$ ll
112+
total 15856720
113+
-rw-r--r-- 1 digoal staff 1.0K Sep 1 17:04 region.parquet
114+
-rw-r--r-- 1 digoal staff 2.3K Sep 1 17:04 nation.parquet
115+
-rw-r--r-- 1 digoal staff 15M Sep 1 17:04 supplier.parquet
116+
-rw-r--r-- 1 digoal staff 44M Sep 1 17:04 tbl.parquet
117+
-rw-r--r-- 1 digoal staff 133M Sep 1 17:04 part.parquet
118+
-rw-r--r-- 1 digoal staff 241M Sep 1 17:04 customer.parquet
119+
-rw-r--r-- 1 digoal staff 865M Sep 1 17:04 partsupp.parquet
120+
-rw-r--r-- 1 digoal staff 1.2G Sep 1 17:05 orders.parquet
121+
-rw-r--r-- 1 digoal staff 5.0G Sep 1 17:07 lineitem.parquet
122+
-rw-r--r-- 1 digoal staff 2.1K Sep 1 17:07 schema.sql
123+
drwxr-xr-x 13 digoal staff 416B Sep 1 17:07 .
124+
-rw-r--r-- 1 digoal staff 933B Sep 1 17:07 load.sql
125+
drwxr-xr-x 23 digoal staff 736B Sep 1 17:07 ..
126+
127+
IT-C02YW2EFLVDL:tpch_20220901 digoal$ du -sh
128+
3.8G
129+
130+
parquet 压缩效果很不错 (5.1G -> 3.8G)
131+
```
132+
133+
启动一个全新的数据库
134+
135+
```
136+
$ ./duckdb ./digoal.db.parquet
137+
138+
当前默认限制了75%内存. 作为执行过程中最多可以使用的内存, 例如sort, hash agg, group agg等.
139+
140+
D select * from duckdb_settings() where name like '%memory%';
141+
┌──────────────┬────────┬─────────────────────────────────────────────┬────────────┐
142+
│ name │ value │ description │ input_type │
143+
├──────────────┼────────┼─────────────────────────────────────────────┼────────────┤
144+
│ max_memory │ 13.7GB │ The maximum memory of the system (e.g. 1GB) │ VARCHAR │
145+
│ memory_limit │ 13.7GB │ The maximum memory of the system (e.g. 1GB) │ VARCHAR │
146+
└──────────────┴────────┴─────────────────────────────────────────────┴────────────┘
147+
```
148+
149+
创建视图
150+
151+
```
152+
CREATE VIEW lineitem AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/lineitem.parquet');
153+
154+
CREATE VIEW orders AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/orders.parquet');
155+
156+
CREATE VIEW partsupp AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/partsupp.parquet');
157+
158+
CREATE VIEW part AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/part.parquet');
159+
160+
CREATE VIEW customer AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/customer.parquet');
161+
162+
CREATE VIEW supplier AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/supplier.parquet');
163+
164+
CREATE VIEW nation AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/nation.parquet');
165+
166+
CREATE VIEW region AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/region.parquet');
167+
```
168+
169+
170+
执行 tpch query
171+
172+
```
173+
配置profile, 输出重定向等.
174+
D PRAGMA enable_profiling='QUERY_TREE_OPTIMIZER';
175+
D PRAGMA enable_optimizer;
176+
D PRAGMA explain_output='all';
177+
D PRAGMA profiling_mode='detailed';
178+
D PRAGMA profile_output='tpch.profile';
179+
D .timer on
180+
D .output my_results.txt
181+
182+
执行SQL
183+
D .read tpch.sql
184+
185+
Run Time (s): real 1.627 user 11.850405 sys 0.405159
186+
Run Time (s): real 0.362 user 1.968710 sys 0.179403
187+
Run Time (s): real 1.079 user 7.420519 sys 0.338523
188+
Run Time (s): real 1.502 user 7.568278 sys 0.740135
189+
Run Time (s): real 0.907 user 6.269927 sys 0.301918
190+
Run Time (s): real 0.917 user 6.665746 sys 0.223548
191+
Run Time (s): real 2.574 user 15.294140 sys 1.005330
192+
Run Time (s): real 1.102 user 7.656021 sys 0.388960
193+
Run Time (s): real 3.488 user 25.737969 sys 0.803692
194+
Run Time (s): real 1.516 user 9.844495 sys 0.756891
195+
Run Time (s): real 0.189 user 0.970978 sys 0.092331
196+
Run Time (s): real 1.186 user 8.812947 sys 0.273467
197+
Run Time (s): real 1.524 user 10.517162 sys 0.541969
198+
Run Time (s): real 0.800 user 5.416071 sys 0.341637
199+
Run Time (s): real 1.970 user 14.342622 sys 0.545280
200+
Run Time (s): real 0.675 user 1.144836 sys 0.130184
201+
Run Time (s): real 2.663 user 18.270699 sys 1.138774
202+
Run Time (s): real 4.069 user 26.420741 sys 2.365622
203+
Run Time (s): real 1.528 user 10.837740 sys 0.425501
204+
Run Time (s): real 2.025 user 11.759343 sys 0.904860
205+
Run Time (s): real 4.324 user 23.275389 sys 3.390268
206+
Run Time (s): real 0.454 user 2.482367 sys 0.256567
207+
```
208+
209+
对比内存表和parquet文件表tpch性能
210+
211+
![pic](20220901_05_pic_001.jpg)
212+
213+
query_id, SF=10 | in_memory(s) | in_parquet(s)
214+
---|---|---
215+
1 | 0.741 | 1.627
216+
2 | 0.239 | 0.362
217+
3 | 0.436 | 1.079
218+
4 | 1.159 | 1.502
219+
5 | 0.421 | 0.907
220+
6 | 0.208 | 0.917
221+
7 | 1.505 | 2.574
222+
8 | 0.453 | 1.102
223+
9 | 2.329 | 3.488
224+
10 | 0.887 | 1.516
225+
11 | 0.115 | 0.189
226+
12 | 0.417 | 1.186
227+
13 | 1.107 | 1.524
228+
14 | 0.333 | 0.8
229+
15 | 1.063 | 1.97
230+
16 | 0.631 | 0.675
231+
17 | 2.208 | 2.663
232+
18 | 3.588 | 4.069
233+
19 | 0.76 | 1.528
234+
20 | 1.632 | 2.025
235+
21 | 3.296 | 4.324
236+
22 | 0.44 | 0.454
237+
238+
## 参考
239+
https://duckdb.org/docs/sql/pragmas
240+
241+
242+
243+
#### [期望 PostgreSQL 增加什么功能?](https://github.com/digoal/blog/issues/76 "269ac3d1c492e938c0191101c7238216")
244+
245+
246+
#### [PolarDB for PostgreSQL云原生分布式开源数据库](https://github.com/ApsaraDB/PolarDB-for-PostgreSQL "57258f76c37864c6e6d23383d05714ea")
247+
248+
249+
#### [PostgreSQL 解决方案集合](https://yq.aliyun.com/topic/118 "40cff096e9ed7122c512b35d8561d9c8")
250+
251+
252+
#### [德哥 / digoal's github - 公益是一辈子的事.](https://github.com/digoal/blog/blob/master/README.md "22709685feb7cab07d30f30387f0a9ae")
253+
254+
255+
![digoal's wechat](../pic/digoal_weixin.jpg "f7ad92eeba24523fd47a6e1a0e691b59")
256+

202209/20220901_05_pic_001.jpg

94.4 KB
Loading

202209/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
### 文章列表
44
----
5+
##### 20220901_05.md [《DuckDB 采用外部 parquet 格式存储 - tpch 测试 - in_memory VS in_parquet》](20220901_05.md)
56
##### 20220901_04.md [《DuckDB 完整的PRAGMA, setting, 系统表, 系统视图, 内置函数, 内置类型 在哪里?》](20220901_04.md)
67
##### 20220901_03.md [《DuckDB 数据库的数据能不能超出内存限制? 以及推荐的使用方法 - parquet》](20220901_03.md)
78
##### 20220901_02.md [《编译安装 DuckDB 最新版本 in MacOS》](20220901_02.md)

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ digoal's|PostgreSQL|文章|归类
9393

9494
### 所有文档如下
9595
----
96+
##### 202209/20220901_05.md [《DuckDB 采用外部 parquet 格式存储 - tpch 测试 - in_memory VS in_parquet》](202209/20220901_05.md)
9697
##### 202209/20220901_04.md [《DuckDB 完整的PRAGMA, setting, 系统表, 系统视图, 内置函数, 内置类型 在哪里?》](202209/20220901_04.md)
9798
##### 202209/20220901_03.md [《DuckDB 数据库的数据能不能超出内存限制? 以及推荐的使用方法 - parquet》](202209/20220901_03.md)
9899
##### 202209/20220901_02.md [《编译安装 DuckDB 最新版本 in MacOS》](202209/20220901_02.md)

0 commit comments

Comments
 (0)