|
| 1 | +## DuckDB 采用外部 parquet 格式存储 - tpch 测试 - in_memory VS in_parquet |
| 2 | + |
| 3 | +### 作者 |
| 4 | +digoal |
| 5 | + |
| 6 | +### 日期 |
| 7 | +2022-09-01 |
| 8 | + |
| 9 | +### 标签 |
| 10 | +PostgreSQL , DuckDB , parquet , tpch |
| 11 | + |
| 12 | +---- |
| 13 | + |
| 14 | +## 背景 |
| 15 | +采用外部存储(parquet), tpch 测试: `sf = 10` |
| 16 | + |
| 17 | +数据生成和tpch sql的生成参考: |
| 18 | +- [《DuckDB TPC-H, TPC-DS 测试》](../202208/20220828_01.md) |
| 19 | + |
| 20 | +为什么要测试parquet格式, 参考: |
| 21 | +- [《DuckDB 数据库的数据能不能超出内存限制? 以及推荐的使用方法 - parquet》](../202209/20220901_03.md) |
| 22 | + |
| 23 | +DuckDB 推荐用法: |
| 24 | +- 数据都尽量存储在parquet中, 内存干嘛用呢? 计算时, hash table, 排序等用到内存. 这样duckdb能管理的数据就可以无限大. |
| 25 | +- duckdb自己的数据文件会膨胀, 由于数据都存储在parquet中, 膨胀后建议导出schema定义, 重新启动一份新的数据文件然后导入schema, 老的数据文件删掉即可. |
| 26 | +- 由于parquet支持pushdown filter\projection, 支持分区, 支持并行. 查询速度是非常快的. 甚至可以使用远程parquet文件存储 (s3, https, oss) . |
| 27 | + |
| 28 | +启动duckdb, 使用持久化文件启动 |
| 29 | + |
| 30 | +``` |
| 31 | +$ ./duckdb ./digoal.db.new |
| 32 | +v0.4.1-dev2371 3825e0ee7 |
| 33 | +Enter ".help" for usage hints. |
| 34 | +``` |
| 35 | + |
| 36 | +加载tpch, 并生成数据 (dbgen不支持直接生成外部parquet文件, 所以只能先过一道数据库.) |
| 37 | + |
| 38 | +``` |
| 39 | +D install tpch; |
| 40 | +D load tpch; |
| 41 | +D copy (select query from tpch_queries()) to 'tpch.sql' with (quote ''); |
| 42 | +D call dbgen(sf='10'); |
| 43 | +``` |
| 44 | + |
| 45 | +导出为parquet文件 |
| 46 | + |
| 47 | +``` |
| 48 | +D EXPORT DATABASE '/Users/digoal/duckdb/build/release/tpch_20220901' (FORMAT PARQUET); |
| 49 | + |
| 50 | +D .quit |
| 51 | + |
| 52 | +drwxr-xr-x 3 digoal staff 96B Sep 1 17:16 digoal.db.new.tmp |
| 53 | +-rw-r--r-- 1 digoal staff 5.1G Sep 1 17:18 digoal.db.new |
| 54 | +-rw-r--r-- 1 digoal staff 0B Sep 1 17:18 digoal.db.new.wal |
| 55 | +``` |
| 56 | + |
| 57 | +生成过程占用了大量swap |
| 58 | + |
| 59 | +``` |
| 60 | +IT-C02YW2EFLVDL:release digoal$ sysctl vm.swapusage |
| 61 | +vm.swapusage: total = 12288.00M used = 11334.00M free = 954.00M (encrypted) |
| 62 | +``` |
| 63 | + |
| 64 | +使用当前格式跑一下tpch结果, 然后就可以删除digoal.db.new数据库文件 |
| 65 | + |
| 66 | +``` |
| 67 | +配置profile, 输出重定向等. |
| 68 | +D PRAGMA enable_profiling='QUERY_TREE_OPTIMIZER'; |
| 69 | +D PRAGMA enable_optimizer; |
| 70 | +D PRAGMA explain_output='all'; |
| 71 | +D PRAGMA profiling_mode='detailed'; |
| 72 | +D PRAGMA profile_output='tpch.profile'; |
| 73 | +D .timer on |
| 74 | +D .output my_results.txt |
| 75 | + |
| 76 | +执行SQL |
| 77 | +D .read tpch.sql |
| 78 | + |
| 79 | +Run Time (s): real 0.741 user 5.706895 sys 0.014485 |
| 80 | +Run Time (s): real 0.239 user 1.222556 sys 0.082958 |
| 81 | +Run Time (s): real 0.436 user 3.036656 sys 0.060910 |
| 82 | +Run Time (s): real 1.159 user 4.839926 sys 0.586248 |
| 83 | +Run Time (s): real 0.421 user 3.028173 sys 0.029702 |
| 84 | +Run Time (s): real 0.208 user 1.574725 sys 0.004798 |
| 85 | +Run Time (s): real 1.505 user 9.384255 sys 0.556902 |
| 86 | +Run Time (s): real 0.453 user 3.273303 sys 0.024077 |
| 87 | +Run Time (s): real 2.329 user 16.336593 sys 0.262405 |
| 88 | +Run Time (s): real 0.887 user 5.570960 sys 0.409520 |
| 89 | +Run Time (s): real 0.115 user 0.440873 sys 0.023601 |
| 90 | +Run Time (s): real 0.417 user 3.124678 sys 0.016045 |
| 91 | +Run Time (s): real 1.107 user 7.185678 sys 0.273737 |
| 92 | +Run Time (s): real 0.333 user 2.278771 sys 0.077510 |
| 93 | +Run Time (s): real 1.063 user 8.115284 sys 0.072995 |
| 94 | +Run Time (s): real 0.631 user 0.912368 sys 0.101576 |
| 95 | +Run Time (s): real 2.208 user 15.257795 sys 0.900340 |
| 96 | +Run Time (s): real 3.588 user 22.501919 sys 2.134568 |
| 97 | +Run Time (s): real 0.760 user 5.589818 sys 0.082662 |
| 98 | +Run Time (s): real 1.632 user 9.039679 sys 0.592879 |
| 99 | +Run Time (s): real 3.296 user 14.681672 sys 2.855988 |
| 100 | +Run Time (s): real 0.440 user 2.177325 sys 0.274837 |
| 101 | + |
| 102 | +rm -rf digoal.db.new* |
| 103 | +``` |
| 104 | + |
| 105 | + |
| 106 | +进入parquet导出的目录, 查看文件列表 |
| 107 | + |
| 108 | +``` |
| 109 | +cd /Users/digoal/duckdb/build/release/tpch_20220901 |
| 110 | + |
| 111 | +IT-C02YW2EFLVDL:tpch_20220901 digoal$ ll |
| 112 | +total 15856720 |
| 113 | +-rw-r--r-- 1 digoal staff 1.0K Sep 1 17:04 region.parquet |
| 114 | +-rw-r--r-- 1 digoal staff 2.3K Sep 1 17:04 nation.parquet |
| 115 | +-rw-r--r-- 1 digoal staff 15M Sep 1 17:04 supplier.parquet |
| 116 | +-rw-r--r-- 1 digoal staff 44M Sep 1 17:04 tbl.parquet |
| 117 | +-rw-r--r-- 1 digoal staff 133M Sep 1 17:04 part.parquet |
| 118 | +-rw-r--r-- 1 digoal staff 241M Sep 1 17:04 customer.parquet |
| 119 | +-rw-r--r-- 1 digoal staff 865M Sep 1 17:04 partsupp.parquet |
| 120 | +-rw-r--r-- 1 digoal staff 1.2G Sep 1 17:05 orders.parquet |
| 121 | +-rw-r--r-- 1 digoal staff 5.0G Sep 1 17:07 lineitem.parquet |
| 122 | +-rw-r--r-- 1 digoal staff 2.1K Sep 1 17:07 schema.sql |
| 123 | +drwxr-xr-x 13 digoal staff 416B Sep 1 17:07 . |
| 124 | +-rw-r--r-- 1 digoal staff 933B Sep 1 17:07 load.sql |
| 125 | +drwxr-xr-x 23 digoal staff 736B Sep 1 17:07 .. |
| 126 | + |
| 127 | +IT-C02YW2EFLVDL:tpch_20220901 digoal$ du -sh |
| 128 | +3.8G |
| 129 | + |
| 130 | +parquet 压缩效果很不错 (5.1G -> 3.8G) |
| 131 | +``` |
| 132 | + |
| 133 | +启动一个全新的数据库 |
| 134 | + |
| 135 | +``` |
| 136 | +$ ./duckdb ./digoal.db.parquet |
| 137 | + |
| 138 | +当前默认限制了75%内存. 作为执行过程中最多可以使用的内存, 例如sort, hash agg, group agg等. |
| 139 | + |
| 140 | +D select * from duckdb_settings() where name like '%memory%'; |
| 141 | +┌──────────────┬────────┬─────────────────────────────────────────────┬────────────┐ |
| 142 | +│ name │ value │ description │ input_type │ |
| 143 | +├──────────────┼────────┼─────────────────────────────────────────────┼────────────┤ |
| 144 | +│ max_memory │ 13.7GB │ The maximum memory of the system (e.g. 1GB) │ VARCHAR │ |
| 145 | +│ memory_limit │ 13.7GB │ The maximum memory of the system (e.g. 1GB) │ VARCHAR │ |
| 146 | +└──────────────┴────────┴─────────────────────────────────────────────┴────────────┘ |
| 147 | +``` |
| 148 | + |
| 149 | +创建视图 |
| 150 | + |
| 151 | +``` |
| 152 | +CREATE VIEW lineitem AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/lineitem.parquet'); |
| 153 | + |
| 154 | +CREATE VIEW orders AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/orders.parquet'); |
| 155 | + |
| 156 | +CREATE VIEW partsupp AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/partsupp.parquet'); |
| 157 | + |
| 158 | +CREATE VIEW part AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/part.parquet'); |
| 159 | + |
| 160 | +CREATE VIEW customer AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/customer.parquet'); |
| 161 | + |
| 162 | +CREATE VIEW supplier AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/supplier.parquet'); |
| 163 | + |
| 164 | +CREATE VIEW nation AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/nation.parquet'); |
| 165 | + |
| 166 | +CREATE VIEW region AS SELECT * FROM read_parquet('/Users/digoal/duckdb/build/release/tpch_20220901/region.parquet'); |
| 167 | +``` |
| 168 | + |
| 169 | + |
| 170 | +执行 tpch query |
| 171 | + |
| 172 | +``` |
| 173 | +配置profile, 输出重定向等. |
| 174 | +D PRAGMA enable_profiling='QUERY_TREE_OPTIMIZER'; |
| 175 | +D PRAGMA enable_optimizer; |
| 176 | +D PRAGMA explain_output='all'; |
| 177 | +D PRAGMA profiling_mode='detailed'; |
| 178 | +D PRAGMA profile_output='tpch.profile'; |
| 179 | +D .timer on |
| 180 | +D .output my_results.txt |
| 181 | + |
| 182 | +执行SQL |
| 183 | +D .read tpch.sql |
| 184 | + |
| 185 | +Run Time (s): real 1.627 user 11.850405 sys 0.405159 |
| 186 | +Run Time (s): real 0.362 user 1.968710 sys 0.179403 |
| 187 | +Run Time (s): real 1.079 user 7.420519 sys 0.338523 |
| 188 | +Run Time (s): real 1.502 user 7.568278 sys 0.740135 |
| 189 | +Run Time (s): real 0.907 user 6.269927 sys 0.301918 |
| 190 | +Run Time (s): real 0.917 user 6.665746 sys 0.223548 |
| 191 | +Run Time (s): real 2.574 user 15.294140 sys 1.005330 |
| 192 | +Run Time (s): real 1.102 user 7.656021 sys 0.388960 |
| 193 | +Run Time (s): real 3.488 user 25.737969 sys 0.803692 |
| 194 | +Run Time (s): real 1.516 user 9.844495 sys 0.756891 |
| 195 | +Run Time (s): real 0.189 user 0.970978 sys 0.092331 |
| 196 | +Run Time (s): real 1.186 user 8.812947 sys 0.273467 |
| 197 | +Run Time (s): real 1.524 user 10.517162 sys 0.541969 |
| 198 | +Run Time (s): real 0.800 user 5.416071 sys 0.341637 |
| 199 | +Run Time (s): real 1.970 user 14.342622 sys 0.545280 |
| 200 | +Run Time (s): real 0.675 user 1.144836 sys 0.130184 |
| 201 | +Run Time (s): real 2.663 user 18.270699 sys 1.138774 |
| 202 | +Run Time (s): real 4.069 user 26.420741 sys 2.365622 |
| 203 | +Run Time (s): real 1.528 user 10.837740 sys 0.425501 |
| 204 | +Run Time (s): real 2.025 user 11.759343 sys 0.904860 |
| 205 | +Run Time (s): real 4.324 user 23.275389 sys 3.390268 |
| 206 | +Run Time (s): real 0.454 user 2.482367 sys 0.256567 |
| 207 | +``` |
| 208 | + |
| 209 | +对比内存表和parquet文件表tpch性能 |
| 210 | + |
| 211 | + |
| 212 | + |
| 213 | +query_id, SF=10 | in_memory(s) | in_parquet(s) |
| 214 | +---|---|--- |
| 215 | +1 | 0.741 | 1.627 |
| 216 | +2 | 0.239 | 0.362 |
| 217 | +3 | 0.436 | 1.079 |
| 218 | +4 | 1.159 | 1.502 |
| 219 | +5 | 0.421 | 0.907 |
| 220 | +6 | 0.208 | 0.917 |
| 221 | +7 | 1.505 | 2.574 |
| 222 | +8 | 0.453 | 1.102 |
| 223 | +9 | 2.329 | 3.488 |
| 224 | +10 | 0.887 | 1.516 |
| 225 | +11 | 0.115 | 0.189 |
| 226 | +12 | 0.417 | 1.186 |
| 227 | +13 | 1.107 | 1.524 |
| 228 | +14 | 0.333 | 0.8 |
| 229 | +15 | 1.063 | 1.97 |
| 230 | +16 | 0.631 | 0.675 |
| 231 | +17 | 2.208 | 2.663 |
| 232 | +18 | 3.588 | 4.069 |
| 233 | +19 | 0.76 | 1.528 |
| 234 | +20 | 1.632 | 2.025 |
| 235 | +21 | 3.296 | 4.324 |
| 236 | +22 | 0.44 | 0.454 |
| 237 | + |
| 238 | +## 参考 |
| 239 | +https://duckdb.org/docs/sql/pragmas |
| 240 | + |
| 241 | + |
| 242 | + |
| 243 | +#### [期望 PostgreSQL 增加什么功能?](https://github.com/digoal/blog/issues/76 "269ac3d1c492e938c0191101c7238216") |
| 244 | + |
| 245 | + |
| 246 | +#### [PolarDB for PostgreSQL云原生分布式开源数据库](https://github.com/ApsaraDB/PolarDB-for-PostgreSQL "57258f76c37864c6e6d23383d05714ea") |
| 247 | + |
| 248 | + |
| 249 | +#### [PostgreSQL 解决方案集合](https://yq.aliyun.com/topic/118 "40cff096e9ed7122c512b35d8561d9c8") |
| 250 | + |
| 251 | + |
| 252 | +#### [德哥 / digoal's github - 公益是一辈子的事.](https://github.com/digoal/blog/blob/master/README.md "22709685feb7cab07d30f30387f0a9ae") |
| 253 | + |
| 254 | + |
| 255 | + |
| 256 | + |
0 commit comments