Skip to content

Commit e7485f3

Browse files
committed
new doc
1 parent 25ca37a commit e7485f3

File tree

10 files changed

+342
-0
lines changed

10 files changed

+342
-0
lines changed

201505/20150507_01.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
## Roaring Bitmap - A better compressed bitset
2+
3+
### 作者
4+
digoal
5+
6+
### 日期
7+
2015-05-07
8+
9+
### 标签
10+
PostgreSQL , roaring bitmap , bitmap index
11+
12+
----
13+
14+
## 背景
15+
### A better compressed bitset
16+
17+
Bitsets, also called bitmaps, are commonly used as fast data structures. Unfortunately, they can use too much memory. To compensate, we often use compressed bitmaps.
18+
19+
Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can be hundreds of times faster and they often offer significantly better compression.
20+
21+
Roaring bitmaps are used in Apache Lucene (as of version 5.0 using anindependent implementation), Druid.io (as of version 0.7) and Apache Spark (as of version 1.2).
22+
23+
Roaring bitmap是一种高效的bitmap压缩算法,应用广泛。
24+
25+
Bitmap indexes are commonly used in databases and search engines.
26+
27+
By exploiting bit-level parallelism, they can significantly accelerate queries.
28+
29+
However, they can use much memory, and thus we might prefer compressed bitmap indexes.
30+
31+
Following Oracle's lead, bitmaps are often compressed using run-length encoding (RLE).
32+
33+
Building on prior work, we introduce the Roaring compressed bitmap format:
34+
it uses packed arrays for compression instead of RLE.
35+
36+
We compare it to two high-performance RLE-based bitmap encoding techniques:
37+
WAH (Word Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable Integer Set).
38+
39+
On synthetic and real data, we find that Roaring bitmaps
40+
41+
(1) often compress significantly better (e.g., 2 times)
42+
43+
(2) are faster than the compressed alternatives (up to 900 times faster for intersections).
44+
45+
Our results challenge the view that RLE-based bitmap compression is best.
46+
47+
roaring bitmap相比Oracle使用的bitmap压缩技术有两点好处。
48+
49+
压缩比高,速度快。
50+
51+
下面摘录一篇论文中的两幅图,说明了历年来出的一些BITMAP压缩算法以及他们之间的关系。
52+
53+
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7040519
54+
55+
http://www.academia.edu/7140159/Bitmap_index_in_search_of_Internet_traffic_big_data
56+
57+
![pic](20150507_01_pic_001.png)
58+
59+
![pic](20150507_01_pic_002.png)
60+
61+
## PostgreSQL用到的一种bitmap
62+
63+
压缩方法是基于hybrid run-length compression algorithm的,
64+
65+
参考:
66+
67+
https://wiki.postgresql.org/wiki/Bitmap_Indexes
68+
69+
算法:
70+
71+
将bitmap拆成两个部分,第一个部分是header,第二个部分是content,
72+
73+
第一个部分每一个比特位对应第二个部分的一个word.
74+
75+
第一个部分比特0表示对应的第二个部分的word是未压缩的。
76+
77+
第一个部分比特1表示对应的第二个部分的word是压缩的。
78+
79+
如果第二个部分的word是压缩的,第一个比特位表示压缩存储的值是1还是0,剩余的比特位表示压缩存储了多少个word。
80+
81+
所以1个word是没有压缩意义的,至少要2个word压缩才有意义.
82+
83+
Header Section (Header Words)
84+
85+
The header section contains bits, each of which corresponds to a word in the content section. If a bit in the header section is 1, then the corresponding word in the content section is a compressed word; if the bit is 0, then the corresponding word is not a compressed word.
86+
87+
88+
Content Section (Content Words)
89+
90+
For a compressed word in the content section, the first bit in this word indicates whether 1s or 0s are compressed. The rest of the bits represent the value of "<the number of bits>/<word size>".
91+
92+
Example
93+
94+
Consider the uncompressed bitmap vector for LOV item M:
95+
96+
```
97+
11111111 10001000 11110001 11100010 11111111 11111111
98+
```
99+
100+
If the size of a word is set to 8, then an HRL compressed form for this bitmap vector is as follows:
101+
102+
```
103+
header section: 00001
104+
105+
content section: 11111111 10001000 11110001 11100010 10000010
106+
```
107+
108+
The first word is uncompressed.
109+
110+
The second word is uncompressed.
111+
112+
The fifth word is compressed and it's first bit is set to one. As such it compresses ones. As 10 evaluates to 2, this compressed word represents 16 bits of ones (2 * 8 = 16).
113+
114+
可以做排序优化来提高压缩比。
115+
116+
排序后,更好的发挥压缩算法。
117+
118+
Sort-based Optimization
119+
120+
As bitmap indexes are often used in data warehousing systems, pre-sorting the values during the ETL stage can offer much better compression.
121+
122+
Example
123+
124+
Consider the uncompressed bitmap vector for LOV item M:
125+
126+
```
127+
00000000 00000111 11111111 11111111 11111111 11111111
128+
```
129+
130+
If the size of a word is set to 8, then an HRL compressed form for this bitmap vector is as follows:
131+
132+
```
133+
header section: 001
134+
135+
content section: 00000000 00000111 10000100
136+
```
137+
138+
The first word is uncompressed.
139+
140+
The second word is uncompressed.
141+
142+
The third word is compressed and it's first bit is set to one. As such it compresses ones. As 100 evaluates to 4, this compressed word represents 32 bits of ones (4 * 8 = 32).
143+
144+
## 参考
145+
1\. http://roaringbitmap.org/
146+
147+
2\. https://github.com/andreasvc/roaringbitmap
148+
149+
3\. https://pypi.python.org/pypi/roaringbitmap/0.1
150+
151+
4\. http://arxiv.org/abs/1402.6407
152+
153+
5. http://arxiv.org/pdf/1402.6407.pdf
154+
155+
6\. http://lemire.me/data/realroaring2014.html
156+
157+
7\. http://www.postgresql.org/message-id/flat/[email protected]#[email protected]
158+
159+
8\. https://wiki.postgresql.org/wiki/Bitmap_Indexes
160+
161+
9\. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7040519
162+
163+
10\. http://en.wikipedia.org/wiki/Run-length_encoding
164+
165+
11\. http://www.academia.edu/7140159/Bitmap_index_in_search_of_Internet_traffic_big_data
166+
167+
12\. [pdf1](20150507_01_pdf_001.pdf)
168+
169+
13\. [pdf2](20150507_01_pdf_002.pdf)

201505/20150507_01_pdf_001.pdf

1.9 MB
Binary file not shown.

201505/20150507_01_pdf_002.pdf

2.11 MB
Binary file not shown.

201505/20150507_01_pic_001.png

40.9 KB
Loading

201505/20150507_01_pic_002.png

101 KB
Loading

201505/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
##### 20150511_01.md [《parallel blocking|waiting by slow BLOCK extend relation , ExclusiveLock on extension of relation》](20150511_01.md)
99
##### 20150510_01.md [《PostgreSQL find out session's current query blocked by which transaction use pg_locks & pg_stat_activity》](20150510_01.md)
1010
##### 20150509_01.md [《PostgreSQL 代码性能诊断之 - OProfile & Systemtap》](20150509_01.md)
11+
##### 20150507_01.md [《Roaring Bitmap - A better compressed bitset》](20150507_01.md)
1112
##### 20150506_07.md [《PostgreSQL 检查点性能影响及源码分析 - 7》](20150506_07.md)
1213
##### 20150506_06.md [《PostgreSQL 检查点性能影响及源码分析 - 6》](20150506_06.md)
1314
##### 20150506_05.md [《PostgreSQL 检查点性能影响及源码分析 - 5》](20150506_05.md)

201705/20170512_01.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
## Greenplum 最佳实践 - 什么时候选择bitmap索引
2+
3+
### 作者
4+
digoal
5+
6+
### 日期
7+
2017-05-12
8+
9+
### 标签
10+
PostgreSQL , Greenplum , bitmap index
11+
12+
----
13+
14+
## 背景
15+
PostgreSQL 目前支持8种索引接口,包括B-Tree, hash, gin, gist, sp-gist, brin, rum, bloom。
16+
17+
Greenplum 目前支持B-Tree, GiST, bitmap三种索引接口。
18+
19+
用户可以根据不同的数据类型,不同的请求类型,使用不同的索引接口建立相应的索引。例如对于数组,全文检索类型,可以使用GIN索引,对于地理位置数据,范围数据类型,图像特征值数据,几何类数据等,可以选择GiST索引。
20+
21+
PG的八种索引的介绍,可以参考bruce写的index internal、源码以及如下文档
22+
23+
http://leopard.in.ua/2015/04/13/postgresql-indexes#.WRHHH_mGOiQ
24+
25+
## bitmap index 原理
26+
如图所示,bitmap索引将每个被索引的VALUE作为KEY,使用每个BIT表示一行,当这行中包含这个VALUE时,设置为1,否则设置为0。
27+
28+
![pic](20170512_01_pic_001.jpg)
29+
30+
如何从bitmap 索引检索数据,并查找到对应HEAP表的记录呢?
31+
32+
必须要有一个mapping 转换功能(函数),才能将BIT位翻译为行号。例如第一个BIT代表第一行,。。。以此类推。(当然了,mapping函数没有这么简单,还有很多优化技巧)
33+
34+
bitmap 的优化技术举例,比如
35+
36+
1\. 压缩
37+
38+
例如连续的0或1可以被压缩,具体可以参考WIKI里面关于BITMAP的压缩算法,算法也是比较多的。
39+
40+
2\. 分段或分段压缩
41+
42+
例如,每个数据块作为一个分段,每个分段内,记录这个数据块中的VALU对应的BIT信息。
43+
44+
3\. 排序
45+
46+
排序是为了更好的进行压缩,例如堆表按被索引的列进行排序后,每个VALUE对应的行号就都是连续的了,压缩比很高。
47+
48+
另外用户也可以参考一下roaring bitmap这个位图库,应用非常广泛,效果也很不错。
49+
50+
https://github.com/zeromax007/gpdb-roaringbitmap
51+
52+
https://github.com/RoaringBitmap/CRoaring
53+
54+
## bitmap index 适合什么场景
55+
从bitmap index的结构我们了解到,被索引的列上面,每一个value都分别对应一个BIT串,BIT串的长度是记录数,每个BIT代表一行,1表示该行存在这个值,0表示该行不存在这个值。
56+
57+
因此bitmap index索引的列,不能有太多的VALUE,最好是100到10万个VALUE,也就是说,这样的表的BITMAP索引有100到10万条BIT串。
58+
59+
当我们对这个表的这个字段进行类似这样的查询时,效率就非常高。
60+
61+
```
62+
select * from table where col = a and col = b and col2=xxx;
63+
-- a,b的bit串进行BITAND的操作,然后再和col2=xxx的BIT进行BITAND操作,返回BIT位为1的,使用bitmap function返回行号,取记录。
64+
65+
select count(*) from table where col = a and col = b and col2=xxx;
66+
-- a,b的bit串进行BITAND的操作,然后再和col2=xxx的BIT进行BITAND操作,返回BIT位为1的,使用bitmap function返回行号,取记录,算count(*)。
67+
```
68+
69+
1\. 适合有少量不重复值的列 。
70+
71+
2\. 适合多个条件的查询,条件越多,bit and,or 的操作过滤掉的数据就越多,返回结果集越少。
72+
73+
## bitmap index 不适合什么场景
74+
由于每个VALUE都要记录每行的BIT位,所以如果有1亿条记录,那么每个VALUE的BIT串长度就是1亿。如果有100万个不同的VALUE,那么BITMAP INDEX就有100万个长度为1亿的bit串。
75+
76+
1\. 不适合有太多不重复值的表字段。
77+
78+
3\. 同样,也不适合有太少不重复值的列,例如男女。这样的列,除非可以和其他列组合赛选出很少量的结果集,否则返回的结果集是非常庞大的,也是不适合的。
79+
80+
3\. 不适合频繁的更新,因为更新可能带来行迁移,以及VALUE的变化。如果是行迁移,需要更新整个bitmap串。如果是VALUE变化,则需要修改整个与变化相关的VALUE的BIT串。
81+
82+
## greenplum bitmap index手册
83+
### About Bitmap Indexes
84+
Greenplum Database provides the Bitmap index type.
85+
Bitmap indexes are best suited to data warehousing applications and decision support systems with large amounts of data, many ad hoc queries, and few data modification (DML) transactions.
86+
87+
An index provides pointers to the rows in a table that contain a given key value.
88+
A regular index stores a list of tuple IDs for each key corresponding to the rows with that key value.
89+
Bitmap indexes store a bitmap for each key value. Regular indexes can be several times larger than the data in the table,
90+
but bitmap indexes provide the same functionality as a regular index and use a fraction of the size of the indexed data.
91+
92+
Each bit in the bitmap corresponds to a possible tuple ID. If the bit is set, the row with the corresponding tuple ID contains the key value.
93+
A mapping function converts the bit position to a tuple ID. Bitmaps are compressed for storage.
94+
If the number of distinct key values is small, bitmap indexes are much smaller, compress better,
95+
and save considerable space compared with a regular index.
96+
The size of a bitmap index is proportional to the number of rows in the table times the number of distinct values in the indexed column.
97+
98+
Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE clause.
99+
Rows that satisfy some, but not all, conditions are filtered out before the table is accessed.
100+
This improves response time, often dramatically.
101+
102+
### When to Use Bitmap Indexes
103+
Bitmap indexes are best suited to data warehousing applications where users query the data rather than update it.
104+
Bitmap indexes perform best for columns that have between 100 and 100,000 distinct values and when the indexed column is often queried in conjunction with other indexed columns.
105+
Columns with fewer than 100 distinct values, such as a gender column with two distinct values (male and female),
106+
usually do not benefit much from any type of index.
107+
On a column with more than 100,000 distinct values, the performance and space efficiency of a bitmap index decline.
108+
109+
Bitmap indexes can improve query performance for ad hoc queries.
110+
AND and OR conditions in the WHERE clause of a query can be resolved quickly by performing the corresponding Boolean operations directly on the bitmaps before converting the resulting bitmap to tuple ids.
111+
If the resulting number of rows is small, the query can be answered quickly without resorting to a full table scan.
112+
113+
### When Not to Use Bitmap Indexes
114+
Do not use bitmap indexes for unique columns or columns with high cardinality data,
115+
such as customer names or phone numbers.
116+
The performance gains and disk space advantages of bitmap indexes start to diminish on columns with 100,000 or more unique values,
117+
regardless of the number of rows in the table.
118+
119+
Bitmap indexes are not suitable for OLTP applications with large numbers of concurrent transactions modifying the data.
120+
121+
Use bitmap indexes sparingly. Test and compare query performance with and without an index.
122+
Add an index only if query performance improves with indexed columns.
123+
124+
## greenplum中如何创建bitmap index
125+
```
126+
CREATE INDEX title_bmp_idx ON films USING bitmap (title);
127+
```
128+
129+
## bitmap 在PG数据库中的应用
130+
131+
在PostgreSQL中虽然没有bitmap索引,但是多个条件的查询,支持自动生成BIT,并通过BitmapAnd, BitmapOr操作,计算并合并结果。
132+
133+
PostgreSQL is not provide persistent bitmap index.
134+
135+
But it can be used in database to combine multiple indexes.
136+
137+
PostgreSQL scans each needed index and prepares a bitmap in memory giving the
138+
locations of table rows that are reported as matching that index’s conditions.
139+
140+
The bitmaps are then ANDed and ORed together as needed by the query.
141+
142+
Finally, the actual table rows are visited and returned.
143+
144+
## bitmap 的其他应用
145+
bitmap在阿里云RDS PG中进行了扩展,支持更多的BIT操作,用户可以通过varbit来维护自己业务数据相关的BIT索引(字段),例如用户画像系统,铁路售票系统,门禁广告系统等。
146+
147+
[《阿里云RDS for PostgreSQL varbitx插件与实时画像应用场景介绍》](../201705/20170502_01.md)
148+
149+
[《基于 阿里云 RDS PostgreSQL 打造实时用户画像推荐系统》](../201610/20161021_01.md)
150+
151+
[《PostgreSQL 与 12306 抢火车票的思考》](../201611/20161124_02.md)
152+
153+
[《门禁广告销售系统需求剖析 与 PostgreSQL数据库实现》](../201611/20161124_01.md)
154+
155+
另外,roaring bitmap也可以作为一种数据类型,植入到PG中。
156+
157+
https://github.com/zeromax007/gpdb-roaringbitmap
158+
159+
## 参考
160+
https://gpdb.docs.pivotal.io/4390/admin_guide/ddl/ddl-index.html#topic93
161+
162+
http://leopard.in.ua/2015/04/13/postgresql-indexes#.WRHHH_mGOiQ
163+
164+
[《阿里云RDS for PostgreSQL varbitx插件与实时画像应用场景介绍》](../201705/20170502_01.md)
165+
166+
[《基于 阿里云 RDS PostgreSQL 打造实时用户画像推荐系统》](../201610/20161021_01.md)
167+
168+
https://github.com/zeromax007/gpdb-roaringbitmap
169+

201705/20170512_01_pic_001.jpg

41.2 KB
Loading

201705/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
### 文章列表
22
----
3+
##### 20170512_01.md [《Greenplum 最佳实践 - 什么时候选择bitmap索引》](20170512_01.md)
34
##### 20170511_02.md [《PostgreSQL 异步IO实测》](20170511_02.md)
45
##### 20170511_01.md [《PostgreSQL schemaless 的实现(类mongodb collection)》](20170511_01.md)
56
##### 20170509_03.md [《如何用PostgreSQL节能减排 - 1》](20170509_03.md)

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ digoal's|PostgreSQL|文章|归类
2828

2929
### 未归类文档如下
3030
----
31+
##### 201705/20170512_01.md [《Greenplum 最佳实践 - 什么时候选择bitmap索引》](201705/20170512_01.md)
3132
##### 201705/20170511_02.md [《PostgreSQL 异步IO实测》](201705/20170511_02.md)
3233
##### 201705/20170511_01.md [《PostgreSQL schemaless 的实现(类mongodb collection)》](201705/20170511_01.md)
3334
##### 201705/20170509_03.md [《如何用PostgreSQL节能减排 - 1》](201705/20170509_03.md)
@@ -602,6 +603,7 @@ digoal's|PostgreSQL|文章|归类
602603
##### 201505/20150511_01.md [《parallel blocking|waiting by slow BLOCK extend relation , ExclusiveLock on extension of relation》](201505/20150511_01.md)
603604
##### 201505/20150510_01.md [《PostgreSQL find out session's current query blocked by which transaction use pg_locks & pg_stat_activity》](201505/20150510_01.md)
604605
##### 201505/20150509_01.md [《PostgreSQL 代码性能诊断之 - OProfile & Systemtap》](201505/20150509_01.md)
606+
##### 201505/20150507_01.md [《Roaring Bitmap - A better compressed bitset》](201505/20150507_01.md)
605607
##### 201505/20150506_07.md [《PostgreSQL 检查点性能影响及源码分析 - 7》](201505/20150506_07.md)
606608
##### 201505/20150506_06.md [《PostgreSQL 检查点性能影响及源码分析 - 6》](201505/20150506_06.md)
607609
##### 201505/20150506_05.md [《PostgreSQL 检查点性能影响及源码分析 - 5》](201505/20150506_05.md)

0 commit comments

Comments
 (0)