|
| 1 | +## PostgreSQL sharding有序UUID最佳实践 - serial global uuid stored in 64bit int8 |
| 2 | + |
| 3 | +### 作者 |
| 4 | +digoal |
| 5 | + |
| 6 | +### 日期 |
| 7 | +2013-01-22 |
| 8 | + |
| 9 | +### 标签 |
| 10 | +PostgreSQL , uuid , 有序uuid , sharding |
| 11 | + |
| 12 | +---- |
| 13 | + |
| 14 | +## 背景 |
| 15 | +Instagram 使用PostgreSQL数据库, 2012year中国PostgreSQL用户大会的时候他们来做过一次交流。 |
| 16 | + |
| 17 | +现在Instagram的月度活跃用户数已经超过9000万,每天照片上传量超4000万。 |
| 18 | + |
| 19 | +sharding, 一个非常关键的算法是如何产生所有节点全局唯一的ID。 |
| 20 | + |
| 21 | +Instagram 使用int8来存储这个唯一ID. 把64个bit位拆成3个部分, 如下 : |
| 22 | + |
| 23 | +1\. 最高的41个bit位, 存储从某时间点开始经过的毫秒数. (区别于UNIX epoch, 自'1970-01-01 00:00:00' 以来的秒数) |
| 24 | + |
| 25 | +2\. 接下来的13个bit位, 存储shard ID. |
| 26 | + |
| 27 | +3\. 最后10个bit位, 存储序列值. |
| 28 | + |
| 29 | +例如 : |
| 30 | + |
| 31 | +1\. 指定'2010-01-01 00:00:00' 为这个起点, 这41个bit存储的是从这个起点开始历经的毫秒数. |
| 32 | + |
| 33 | +41个bit位无符号的情况下可以存储2^41=2199023255552个数字, 也就是约69.7year的数据. |
| 34 | + |
| 35 | +``` |
| 36 | +postgres=# select (2^41)/1000/60/60/24/365.0; |
| 37 | + ?column? |
| 38 | +------------------ |
| 39 | + 69.7305700010147 |
| 40 | +(1 row) |
| 41 | +``` |
| 42 | + |
| 43 | +如果把起始值设置为'2012-01-01'的话, 69.7year后也就是 '2081-09-01' 后这个算法将会有问题. 因为数值将大于41个bit位. |
| 44 | + |
| 45 | +2\. shard ID用了13个bit位, 所以可以存储8192个shard节点的信息. |
| 46 | + |
| 47 | +``` |
| 48 | +postgres=# select 2^13; |
| 49 | + ?column? |
| 50 | +---------- |
| 51 | + 8192 |
| 52 | +(1 row) |
| 53 | +``` |
| 54 | + |
| 55 | +如果每个shard节点用到1个主机, 使用这个算法的集群最大可以扩到8192个主机. |
| 56 | + |
| 57 | +3\. 序列值占用10个bit位, 可以存储1024个值. |
| 58 | + |
| 59 | +``` |
| 60 | +postgres=# select 2^10; |
| 61 | + ?column? |
| 62 | +---------- |
| 63 | + 1024 |
| 64 | +(1 row) |
| 65 | +``` |
| 66 | + |
| 67 | +因此可以这么来理解. 在1毫秒内, 每个shard节点, 允许产生1024个唯一值. 1秒产生102.4万个唯一值. |
| 68 | + |
| 69 | +整个集群1秒允许产生102.4*8192 = 83.88608亿个唯一值. |
| 70 | + |
| 71 | +``` |
| 72 | +postgres=# select 1024*1000; |
| 73 | + ?column? |
| 74 | +---------- |
| 75 | + 1024000 |
| 76 | +(1 row) |
| 77 | +``` |
| 78 | + |
| 79 | +前段时间测试过2.0GHz 至强 8核的主机每秒约生成11万个序列值. 所以102.4万个唯一值这个宽度对于一台shard节点来说应该是没有问题的. |
| 80 | + |
| 81 | +``` |
| 82 | +PostgreSQL 的序列性能测试 : |
| 83 | +测试机 : |
| 84 | +CentOS 5.7 x64 |
| 85 | +PostgreSQL 9.2.1 |
| 86 | +DELL R610 |
| 87 | +CPU 2 * Intel(R) Xeon(R) CPU E5504 @ 2.00GHz |
| 88 | + |
| 89 | +1. 测试不开启cache的情况下取序列的速度 : |
| 90 | +创建序列 : |
| 91 | +ocz@db-172-16-3-150-> psql |
| 92 | +psql (9.2.1) |
| 93 | +Type "help" for help. |
| 94 | +digoal=> create sequence seq_test; |
| 95 | +CREATE SEQUENCE |
| 96 | +查看当前序列ID : |
| 97 | +digoal=> select * from seq_test ; |
| 98 | +-[ RECORD 1 ]-+-------------------- |
| 99 | +sequence_name | seq_test |
| 100 | +last_value | 1 |
| 101 | +start_value | 1 |
| 102 | +increment_by | 1 |
| 103 | +max_value | 9223372036854775807 |
| 104 | +min_value | 1 |
| 105 | +cache_value | 1 |
| 106 | +log_cnt | 0 |
| 107 | +is_cycled | f |
| 108 | +is_called | f |
| 109 | + |
| 110 | +pgbench测试脚本 : |
| 111 | +ocz@db-172-16-3-150-> cat t.sql |
| 112 | +select nextval('seq_test'); |
| 113 | + |
| 114 | +测试结果 : |
| 115 | +ocz@db-172-16-3-150-> pgbench -M prepared -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal |
| 116 | +transaction type: Custom query |
| 117 | +scaling factor: 1 |
| 118 | +query mode: prepared |
| 119 | +number of clients: 16 |
| 120 | +number of threads: 4 |
| 121 | +duration: 30 s |
| 122 | +number of transactions actually processed: 3085448 |
| 123 | +tps = 102832.533289 (including connections establishing) |
| 124 | +tps = 102891.321540 (excluding connections establishing) |
| 125 | +statement latencies in milliseconds: |
| 126 | + 0.153352 select nextval('seq_test'); |
| 127 | +由此看出不启用cache的情况下每秒可取102891个序列值. |
| 128 | + |
| 129 | +2. 测试开启cache的情况下取序列的速度 : |
| 130 | +digoal=> alter sequence seq_test restart with 1; |
| 131 | +ALTER SEQUENCE |
| 132 | +digoal=> alter sequence seq_test cache 100; |
| 133 | +ALTER SEQUENCE |
| 134 | +ocz@db-172-16-3-150-> pgbench -M prepared -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal |
| 135 | +transaction type: Custom query |
| 136 | +scaling factor: 1 |
| 137 | +query mode: prepared |
| 138 | +number of clients: 16 |
| 139 | +number of threads: 4 |
| 140 | +duration: 30 s |
| 141 | +number of transactions actually processed: 3359853 |
| 142 | +tps = 111975.743127 (including connections establishing) |
| 143 | +tps = 112049.881187 (excluding connections establishing) |
| 144 | +statement latencies in milliseconds: |
| 145 | + 0.140799 select nextval('seq_test'); |
| 146 | +ocz@db-172-16-3-150-> psql digoal digoal |
| 147 | +psql (9.2.1) |
| 148 | +Type "help" for help. |
| 149 | +digoal=> \x |
| 150 | +Expanded display is on. |
| 151 | +digoal=> select * from seq_test ; |
| 152 | +-[ RECORD 1 ]-+-------------------- |
| 153 | +sequence_name | seq_test |
| 154 | +last_value | 3360400 |
| 155 | +start_value | 1 |
| 156 | +increment_by | 1 |
| 157 | +max_value | 9223372036854775807 |
| 158 | +min_value | 1 |
| 159 | +cache_value | 100 |
| 160 | +log_cnt | 32 |
| 161 | +is_cycled | f |
| 162 | +is_called | t |
| 163 | +获取速度为112049每秒. 略有提高. 但是如果非长连接的话, 将造成巨大的浪费. 如下 : |
| 164 | +digoal=> alter sequence seq_test restart with 1; |
| 165 | +ALTER SEQUENCE |
| 166 | + |
| 167 | +调整pgbench参数 : |
| 168 | + -C establish new connection for each transaction |
| 169 | + |
| 170 | +测试结果 : |
| 171 | +ocz@db-172-16-3-150-> pgbench -M simple -C -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal |
| 172 | +transaction type: Custom query |
| 173 | +scaling factor: 1 |
| 174 | +query mode: simple |
| 175 | +number of clients: 16 |
| 176 | +number of threads: 4 |
| 177 | +duration: 30 s |
| 178 | +number of transactions actually processed: 25865 |
| 179 | +tps = 861.986914 (including connections establishing) |
| 180 | +tps = 59712.243088 (excluding connections establishing) |
| 181 | +statement latencies in milliseconds: |
| 182 | + 13.960707 select nextval('seq_test'); |
| 183 | +查看最后的序列值 : |
| 184 | +ocz@db-172-16-3-150-> psql digoal digoal |
| 185 | +psql (9.2.1) |
| 186 | +Type "help" for help. |
| 187 | +digoal=> \x |
| 188 | +Expanded display is on. |
| 189 | +digoal=> select * from seq_test ; |
| 190 | +-[ RECORD 1 ]-+-------------------- |
| 191 | +sequence_name | seq_test |
| 192 | +last_value | 2588100 |
| 193 | +start_value | 1 |
| 194 | +increment_by | 1 |
| 195 | +max_value | 9223372036854775807 |
| 196 | +min_value | 1 |
| 197 | +cache_value | 100 |
| 198 | +log_cnt | 32 |
| 199 | +is_cycled | f |
| 200 | +is_called | t |
| 201 | +实际处理事务为25865个, 但是序列值已经增长到了2588100, 100倍的浪费. |
| 202 | +测试结果仅供参考. |
| 203 | +``` |
| 204 | + |
| 205 | +例子 : |
| 206 | + |
| 207 | +假设起点为'2012-01-01', 转换成unix epoch再转换成毫秒后: 1325376000000 |
| 208 | + |
| 209 | +``` |
| 210 | +postgres=# select EXTRACT(EPOCH FROM '2012-01-01 00:00:00'::timestamp) * 1000; |
| 211 | + ?column? |
| 212 | +--------------- |
| 213 | + 1325376000000 |
| 214 | +(1 row) |
| 215 | +``` |
| 216 | + |
| 217 | +逻辑的shard可以使用schema来区分, 当然也可以使用database来区分. 本例使用schema来区分. |
| 218 | + |
| 219 | +需要为每个shard创建生成全局唯一ID的函数 : |
| 220 | + |
| 221 | +以 shard_id = 5 这个shard节点为例, 起始epoch = 1325376000000. |
| 222 | + |
| 223 | +函数如下 : |
| 224 | + |
| 225 | +``` |
| 226 | +CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$ |
| 227 | +DECLARE |
| 228 | + our_epoch bigint := 1325376000000; |
| 229 | + seq_id bigint; |
| 230 | + now_millis bigint; |
| 231 | + shard_id int := 5; |
| 232 | +BEGIN |
| 233 | + SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id; |
| 234 | + SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis; |
| 235 | + result := (now_millis - our_epoch) << 23; |
| 236 | + result := result | (shard_id << 10); |
| 237 | + result := result | (seq_id); |
| 238 | +END; |
| 239 | +$$ LANGUAGE PLPGSQL; |
| 240 | +``` |
| 241 | + |
| 242 | +shard 5 中对应的表结构, id为全局唯一主键, 默认值来自上面的函数产生的值. |
| 243 | + |
| 244 | +``` |
| 245 | +CREATE TABLE insta5.our_table ( |
| 246 | + "id" bigint NOT NULL DEFAULT insta5.next_id(), |
| 247 | + ...rest of table schema... |
| 248 | +) |
| 249 | +``` |
| 250 | + |
| 251 | +## 小结 |
| 252 | +1\. 因为int8是带符号整型, 如果第一个BIT=1, 得出负数. |
| 253 | + |
| 254 | +因此前34.87year这个函数产生的是正数, 后34.87year这个函数产生的是负数. |
| 255 | + |
| 256 | +``` |
| 257 | +postgres=# select 2^40/1000/60/60/24/365; |
| 258 | + ?column? |
| 259 | +------------------ |
| 260 | + 34.8652850005074 |
| 261 | +(1 row) |
| 262 | +``` |
| 263 | + |
| 264 | +例如 : |
| 265 | + |
| 266 | +正值 : |
| 267 | + |
| 268 | +``` |
| 269 | +postgres=# select date '2012-01-01'+ interval '34.8652850005074 year' |
| 270 | +postgres-# ; |
| 271 | + ?column? |
| 272 | +--------------------- |
| 273 | + 2046-11-01 00:00:00 |
| 274 | +(1 row) |
| 275 | + |
| 276 | +postgres=# do language plpgsql $$ |
| 277 | +DECLARE |
| 278 | + our_epoch bigint := 1325376000000; |
| 279 | + seq_id bigint; |
| 280 | + now_millis bigint; |
| 281 | + shard_id int := 5; |
| 282 | + result bigint; |
| 283 | +BEGIN |
| 284 | + SELECT 112345 % 1024 INTO seq_id; |
| 285 | + |
| 286 | + SELECT FLOOR(EXTRACT(EPOCH FROM '2046-11-01 00:00:00'::timestamp) * 1000) INTO now_millis; |
| 287 | + result := (now_millis - our_epoch) << 23; |
| 288 | + result := result | (shard_id << 10); |
| 289 | + result := result | (seq_id); |
| 290 | + raise notice '%', result; |
| 291 | +END; |
| 292 | +$$; |
| 293 | +NOTICE: 9221321628057605849 |
| 294 | +DO |
| 295 | +``` |
| 296 | + |
| 297 | +负值 : |
| 298 | + |
| 299 | +``` |
| 300 | +postgres=# select date '2012-01-01'+ interval '34.9652850005074 year'; |
| 301 | + ?column? |
| 302 | +--------------------- |
| 303 | + 2046-12-01 00:00:00 |
| 304 | +(1 row) |
| 305 | + |
| 306 | +postgres=# do language plpgsql $$ |
| 307 | +DECLARE |
| 308 | + our_epoch bigint := 1325376000000; |
| 309 | + seq_id bigint; |
| 310 | + now_millis bigint; |
| 311 | + shard_id int := 5; |
| 312 | + result bigint; |
| 313 | +BEGIN |
| 314 | + SELECT 112345 % 1024 INTO seq_id; |
| 315 | + |
| 316 | + SELECT FLOOR(EXTRACT(EPOCH FROM '2046-12-01 00:00:00'::timestamp) * 1000) INTO now_millis; |
| 317 | + result := (now_millis - our_epoch) << 23; |
| 318 | + result := result | (shard_id << 10); |
| 319 | + result := result | (seq_id); |
| 320 | + raise notice '%', result; |
| 321 | +END; |
| 322 | +$$; |
| 323 | +NOTICE: -9203679173715945767 |
| 324 | +DO |
| 325 | +``` |
| 326 | + |
| 327 | +2\. 这个算法的好处还有1个就是它产生的值是有顺序的, 不是无序的UUID. 因此存储顺序和索引的顺序一致性非常高. |
| 328 | + |
| 329 | +对于使用索引查找是非常有效的. |
| 330 | + |
| 331 | +并且这个算法对于shard也非常方便. |
| 332 | + |
| 333 | +## 参考 |
| 334 | +1\. |
| 335 | + |
| 336 | +``` |
| 337 | +postgres=# \do | |
| 338 | + List of operators |
| 339 | + Schema | Name | Left arg type | Right arg type | Result type | Description |
| 340 | +------------+------+---------------+----------------+-------------+------------------- |
| 341 | + pg_catalog | | | bigint | bigint | bigint | bitwise or |
| 342 | +``` |
0 commit comments