Skip to content

Commit 27d6d68

Browse files
committed
10.0
1 parent d2ab0e6 commit 27d6d68

File tree

11 files changed

+1177
-1
lines changed

11 files changed

+1177
-1
lines changed

201210/20121024_01.md

Lines changed: 689 additions & 0 deletions
Large diffs are not rendered by default.

201210/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
### 文章列表
22
----
3+
##### 20121024_01.md [《PostgreSQL 优化CASE - 无序UUID性能问题诊断》](20121024_01.md)
34
##### 20121022_01.md [《use pg_stat_plans in PostgreSQL 9.0, 9.1 and 9.2》](20121022_01.md)
45
##### 20121014_01.md [《How many xlogs | WAL can be generated by PostgreSQL?》](20121014_01.md)

201301/20130122_01.md

Lines changed: 342 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,342 @@
1+
## PostgreSQL sharding有序UUID最佳实践 - serial global uuid stored in 64bit int8
2+
3+
### 作者
4+
digoal
5+
6+
### 日期
7+
2013-01-22
8+
9+
### 标签
10+
PostgreSQL , uuid , 有序uuid , sharding
11+
12+
----
13+
14+
## 背景
15+
Instagram 使用PostgreSQL数据库, 2012year中国PostgreSQL用户大会的时候他们来做过一次交流。
16+
17+
现在Instagram的月度活跃用户数已经超过9000万,每天照片上传量超4000万。
18+
19+
sharding, 一个非常关键的算法是如何产生所有节点全局唯一的ID。
20+
21+
Instagram 使用int8来存储这个唯一ID. 把64个bit位拆成3个部分, 如下 :
22+
23+
1\. 最高的41个bit位, 存储从某时间点开始经过的毫秒数. (区别于UNIX epoch, 自'1970-01-01 00:00:00' 以来的秒数)
24+
25+
2\. 接下来的13个bit位, 存储shard ID.
26+
27+
3\. 最后10个bit位, 存储序列值.
28+
29+
例如 :
30+
31+
1\. 指定'2010-01-01 00:00:00' 为这个起点, 这41个bit存储的是从这个起点开始历经的毫秒数.
32+
33+
41个bit位无符号的情况下可以存储2^41=2199023255552个数字, 也就是约69.7year的数据.
34+
35+
```
36+
postgres=# select (2^41)/1000/60/60/24/365.0;
37+
?column?
38+
------------------
39+
69.7305700010147
40+
(1 row)
41+
```
42+
43+
如果把起始值设置为'2012-01-01'的话, 69.7year后也就是 '2081-09-01' 后这个算法将会有问题. 因为数值将大于41个bit位.
44+
45+
2\. shard ID用了13个bit位, 所以可以存储8192个shard节点的信息.
46+
47+
```
48+
postgres=# select 2^13;
49+
?column?
50+
----------
51+
8192
52+
(1 row)
53+
```
54+
55+
如果每个shard节点用到1个主机, 使用这个算法的集群最大可以扩到8192个主机.
56+
57+
3\. 序列值占用10个bit位, 可以存储1024个值.
58+
59+
```
60+
postgres=# select 2^10;
61+
?column?
62+
----------
63+
1024
64+
(1 row)
65+
```
66+
67+
因此可以这么来理解. 在1毫秒内, 每个shard节点, 允许产生1024个唯一值. 1秒产生102.4万个唯一值.
68+
69+
整个集群1秒允许产生102.4*8192 = 83.88608亿个唯一值.
70+
71+
```
72+
postgres=# select 1024*1000;
73+
?column?
74+
----------
75+
1024000
76+
(1 row)
77+
```
78+
79+
前段时间测试过2.0GHz 至强 8核的主机每秒约生成11万个序列值. 所以102.4万个唯一值这个宽度对于一台shard节点来说应该是没有问题的.
80+
81+
```
82+
PostgreSQL 的序列性能测试 :
83+
测试机 :
84+
CentOS 5.7 x64
85+
PostgreSQL 9.2.1
86+
DELL R610
87+
CPU 2 * Intel(R) Xeon(R) CPU E5504 @ 2.00GHz
88+
89+
1. 测试不开启cache的情况下取序列的速度 :
90+
创建序列 :
91+
ocz@db-172-16-3-150-> psql
92+
psql (9.2.1)
93+
Type "help" for help.
94+
digoal=> create sequence seq_test;
95+
CREATE SEQUENCE
96+
查看当前序列ID :
97+
digoal=> select * from seq_test ;
98+
-[ RECORD 1 ]-+--------------------
99+
sequence_name | seq_test
100+
last_value | 1
101+
start_value | 1
102+
increment_by | 1
103+
max_value | 9223372036854775807
104+
min_value | 1
105+
cache_value | 1
106+
log_cnt | 0
107+
is_cycled | f
108+
is_called | f
109+
110+
pgbench测试脚本 :
111+
ocz@db-172-16-3-150-> cat t.sql
112+
select nextval('seq_test');
113+
114+
测试结果 :
115+
ocz@db-172-16-3-150-> pgbench -M prepared -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal
116+
transaction type: Custom query
117+
scaling factor: 1
118+
query mode: prepared
119+
number of clients: 16
120+
number of threads: 4
121+
duration: 30 s
122+
number of transactions actually processed: 3085448
123+
tps = 102832.533289 (including connections establishing)
124+
tps = 102891.321540 (excluding connections establishing)
125+
statement latencies in milliseconds:
126+
0.153352 select nextval('seq_test');
127+
由此看出不启用cache的情况下每秒可取102891个序列值.
128+
129+
2. 测试开启cache的情况下取序列的速度 :
130+
digoal=> alter sequence seq_test restart with 1;
131+
ALTER SEQUENCE
132+
digoal=> alter sequence seq_test cache 100;
133+
ALTER SEQUENCE
134+
ocz@db-172-16-3-150-> pgbench -M prepared -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal
135+
transaction type: Custom query
136+
scaling factor: 1
137+
query mode: prepared
138+
number of clients: 16
139+
number of threads: 4
140+
duration: 30 s
141+
number of transactions actually processed: 3359853
142+
tps = 111975.743127 (including connections establishing)
143+
tps = 112049.881187 (excluding connections establishing)
144+
statement latencies in milliseconds:
145+
0.140799 select nextval('seq_test');
146+
ocz@db-172-16-3-150-> psql digoal digoal
147+
psql (9.2.1)
148+
Type "help" for help.
149+
digoal=> \x
150+
Expanded display is on.
151+
digoal=> select * from seq_test ;
152+
-[ RECORD 1 ]-+--------------------
153+
sequence_name | seq_test
154+
last_value | 3360400
155+
start_value | 1
156+
increment_by | 1
157+
max_value | 9223372036854775807
158+
min_value | 1
159+
cache_value | 100
160+
log_cnt | 32
161+
is_cycled | f
162+
is_called | t
163+
获取速度为112049每秒. 略有提高. 但是如果非长连接的话, 将造成巨大的浪费. 如下 :
164+
digoal=> alter sequence seq_test restart with 1;
165+
ALTER SEQUENCE
166+
167+
调整pgbench参数 :
168+
-C establish new connection for each transaction
169+
170+
测试结果 :
171+
ocz@db-172-16-3-150-> pgbench -M simple -C -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal
172+
transaction type: Custom query
173+
scaling factor: 1
174+
query mode: simple
175+
number of clients: 16
176+
number of threads: 4
177+
duration: 30 s
178+
number of transactions actually processed: 25865
179+
tps = 861.986914 (including connections establishing)
180+
tps = 59712.243088 (excluding connections establishing)
181+
statement latencies in milliseconds:
182+
13.960707 select nextval('seq_test');
183+
查看最后的序列值 :
184+
ocz@db-172-16-3-150-> psql digoal digoal
185+
psql (9.2.1)
186+
Type "help" for help.
187+
digoal=> \x
188+
Expanded display is on.
189+
digoal=> select * from seq_test ;
190+
-[ RECORD 1 ]-+--------------------
191+
sequence_name | seq_test
192+
last_value | 2588100
193+
start_value | 1
194+
increment_by | 1
195+
max_value | 9223372036854775807
196+
min_value | 1
197+
cache_value | 100
198+
log_cnt | 32
199+
is_cycled | f
200+
is_called | t
201+
实际处理事务为25865个, 但是序列值已经增长到了2588100, 100倍的浪费.
202+
测试结果仅供参考.
203+
```
204+
205+
例子 :
206+
207+
假设起点为'2012-01-01', 转换成unix epoch再转换成毫秒后: 1325376000000
208+
209+
```
210+
postgres=# select EXTRACT(EPOCH FROM '2012-01-01 00:00:00'::timestamp) * 1000;
211+
?column?
212+
---------------
213+
1325376000000
214+
(1 row)
215+
```
216+
217+
逻辑的shard可以使用schema来区分, 当然也可以使用database来区分. 本例使用schema来区分.
218+
219+
需要为每个shard创建生成全局唯一ID的函数 :
220+
221+
以 shard_id = 5 这个shard节点为例, 起始epoch = 1325376000000.
222+
223+
函数如下 :
224+
225+
```
226+
CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
227+
DECLARE
228+
our_epoch bigint := 1325376000000;
229+
seq_id bigint;
230+
now_millis bigint;
231+
shard_id int := 5;
232+
BEGIN
233+
SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;
234+
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
235+
result := (now_millis - our_epoch) << 23;
236+
result := result | (shard_id << 10);
237+
result := result | (seq_id);
238+
END;
239+
$$ LANGUAGE PLPGSQL;
240+
```
241+
242+
shard 5 中对应的表结构, id为全局唯一主键, 默认值来自上面的函数产生的值.
243+
244+
```
245+
CREATE TABLE insta5.our_table (
246+
"id" bigint NOT NULL DEFAULT insta5.next_id(),
247+
...rest of table schema...
248+
)
249+
```
250+
251+
## 小结
252+
1\. 因为int8是带符号整型, 如果第一个BIT=1, 得出负数.
253+
254+
因此前34.87year这个函数产生的是正数, 后34.87year这个函数产生的是负数.
255+
256+
```
257+
postgres=# select 2^40/1000/60/60/24/365;
258+
?column?
259+
------------------
260+
34.8652850005074
261+
(1 row)
262+
```
263+
264+
例如 :
265+
266+
正值 :
267+
268+
```
269+
postgres=# select date '2012-01-01'+ interval '34.8652850005074 year'
270+
postgres-# ;
271+
?column?
272+
---------------------
273+
2046-11-01 00:00:00
274+
(1 row)
275+
276+
postgres=# do language plpgsql $$
277+
DECLARE
278+
our_epoch bigint := 1325376000000;
279+
seq_id bigint;
280+
now_millis bigint;
281+
shard_id int := 5;
282+
result bigint;
283+
BEGIN
284+
SELECT 112345 % 1024 INTO seq_id;
285+
286+
SELECT FLOOR(EXTRACT(EPOCH FROM '2046-11-01 00:00:00'::timestamp) * 1000) INTO now_millis;
287+
result := (now_millis - our_epoch) << 23;
288+
result := result | (shard_id << 10);
289+
result := result | (seq_id);
290+
raise notice '%', result;
291+
END;
292+
$$;
293+
NOTICE: 9221321628057605849
294+
DO
295+
```
296+
297+
负值 :
298+
299+
```
300+
postgres=# select date '2012-01-01'+ interval '34.9652850005074 year';
301+
?column?
302+
---------------------
303+
2046-12-01 00:00:00
304+
(1 row)
305+
306+
postgres=# do language plpgsql $$
307+
DECLARE
308+
our_epoch bigint := 1325376000000;
309+
seq_id bigint;
310+
now_millis bigint;
311+
shard_id int := 5;
312+
result bigint;
313+
BEGIN
314+
SELECT 112345 % 1024 INTO seq_id;
315+
316+
SELECT FLOOR(EXTRACT(EPOCH FROM '2046-12-01 00:00:00'::timestamp) * 1000) INTO now_millis;
317+
result := (now_millis - our_epoch) << 23;
318+
result := result | (shard_id << 10);
319+
result := result | (seq_id);
320+
raise notice '%', result;
321+
END;
322+
$$;
323+
NOTICE: -9203679173715945767
324+
DO
325+
```
326+
327+
2\. 这个算法的好处还有1个就是它产生的值是有顺序的, 不是无序的UUID. 因此存储顺序和索引的顺序一致性非常高.
328+
329+
对于使用索引查找是非常有效的.
330+
331+
并且这个算法对于shard也非常方便.
332+
333+
## 参考
334+
1\.
335+
336+
```
337+
postgres=# \do |
338+
List of operators
339+
Schema | Name | Left arg type | Right arg type | Result type | Description
340+
------------+------+---------------+----------------+-------------+-------------------
341+
pg_catalog | | | bigint | bigint | bigint | bitwise or
342+
```

201301/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
----
33
##### 20130128_01.md [《PostgreSQL Developer Options (debug, trace, system table mod and so on...) 详解》](20130128_01.md)
44
##### 20130127_01.md [《PostgreSQL 9.3 will add pg_isready script to test PostgreSQL server if allowed to connect》](20130127_01.md)
5+
##### 20130122_01.md [《PostgreSQL sharding有序UUID最佳实践 - serial global uuid stored in 64bit int8》](20130122_01.md)
56
##### 20130110_01.md [《HOW to Change PostgreSQL's TOAST_TUPLE_THRESHOLD》](20130110_01.md)
67
##### 20130109_01.md [《PostgreSQL large row|column performance tuning case》](20130109_01.md)
78
##### 20130105_01.md [《PostgreSQL partition table name convert to data type》](20130105_01.md)

201702/20170215_03.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -236,14 +236,24 @@ relocatable = true
236236
然后就可以使用 create extension pg_sphere; 创建这个模块了
237237

238238
修改Makefile,略。
239+
240+
## 注意
241+
pg_sphere,2017年开始重新拉的分支.
242+
243+
https://github.com/pgsphere
244+
245+
http://pgsphere.github.io/
246+
247+
http://pgsphere.github.io/download.html
239248

240249
## 参考
241250
http://pgsphere.projects.pgfoundry.org/index.html
242251

243252
http://skyview.gsfc.nasa.gov/xaminblog/index.php/tag/pgsphere/
244253

245254
https://github.com/mnullmei/pgsphere/tree/fixes-1-1-1
255+
256+
[pg_sphere 1.1.5手册](20170215_03_pdf_001.pdf)
246257

247-
248258
[Count](http://info.flagcounter.com/h9V1)
249259

201702/20170215_03_pdf_001.pdf

369 KB
Binary file not shown.

0 commit comments

Comments
 (0)