digoal
2019-12-05
PostgreSQL , icu , collate , 排序 , 大小写不敏感
PG 12 开始支持icu的collate 不区分大小写,或者区分大小写的排序。
https://www.postgresql.org/docs/12/sql-createcollation.html
CREATE COLLATION [ IF NOT EXISTS ] name (
[ LOCALE = locale, ]
[ LC_COLLATE = lc_collate, ]
[ LC_CTYPE = lc_ctype, ]
[ PROVIDER = provider, ]
[ DETERMINISTIC = boolean, ]
[ VERSION = version ]
)
CREATE COLLATION [ IF NOT EXISTS ] name FROM existing_collation
DETERMINISTIC
Specifies whether the collation should use deterministic comparisons. The default is true. A deterministic comparison considers strings that are not byte-wise equal to be unequal even if they are considered logically equal by the comparison. PostgreSQL breaks ties using a byte-wise comparison. Comparison that is not deterministic can make the collation be, say, case- or accent-insensitive. For that, you need to choose an appropriate LC_COLLATE setting and set the collation to not deterministic here.
Nondeterministic collations are only supported with the ICU provider.
- 默认:deterministic
byte-wise equal to be unequal
- 口音、大小写不敏感比较:not deterministic,同时设置好LC_COLLATE,当前仅支持ICU provider.
DETERMINISTIC关键字表示这个collate是否支持不区分大小写。
Comparison that is not deterministic can make the collation be, say, case- or accent-insensitive. For that, you need to choose an appropriate LC_COLLATE setting and set the collation to not deterministic here.
内置的collate查询如下例子:
db1=# select pg_encoding_to_char(collencoding), * from pg_collation ;
pg_encoding_to_char | oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion
---------------------+-------+------------------------+---------------+-----------+--------------+---------------------+--------------+-----------------------+-----------------------+-------------
| 100 | default | 11 | 10 | d | t | -1 | | |
| 950 | C | 11 | 10 | c | t | -1 | C | C |
| 951 | POSIX | 11 | 10 | c | t | -1 | POSIX | POSIX |
UTF8 | 12326 | ucs_basic | 11 | 10 | c | t | 6 | C | C |
LATIN1 | 12327 | aa_DJ | 11 | 10 | c | t | 8 | aa_DJ | aa_DJ |
LATIN1 | 12328 | aa_DJ.iso88591 | 11 | 10 | c | t | 8 | aa_DJ.iso88591 | aa_DJ.iso88591 |
UTF8 | 12329 | aa_DJ.utf8 | 11 | 10 | c | t | 6 | aa_DJ.utf8 | aa_DJ.utf8 |
........
几个ICU相关的collate例子:
-[ RECORD 1506 ]----+-----------------------
pg_encoding_to_char |
oid | 13828
collname | zh-x-icu
collnamespace | 11
collowner | 10
collprovider | i
collisdeterministic | t
collencoding | -1
collcollate | zh
collctype | zh
collversion | 58.0.0.50
-[ RECORD 1507 ]----+-----------------------
pg_encoding_to_char |
oid | 13829
collname | zh-Hans-x-icu
collnamespace | 11
collowner | 10
collprovider | i
collisdeterministic | t
collencoding | -1
collcollate | zh_Hans
collctype | zh_Hans
collversion | 58.0.0.50
-[ RECORD 1508 ]----+-----------------------
pg_encoding_to_char |
oid | 13830
collname | zh-Hans-CN-x-icu
collnamespace | 11
collowner | 10
collprovider | i
collisdeterministic | t
collencoding | -1
collcollate | zh_Hans_CN
collctype | zh_Hans_CN
collversion | 58.0.0.50
https://www.postgresql.org/docs/12/collation.html#COLLATION-NONDETERMINISTIC
23.2.2.4. Nondeterministic Collations
A collation is either deterministic or nondeterministic. A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if they consist of the same byte sequence. Nondeterministic comparison may determine strings to be equal even if they consist of different bytes. Typical situations include case-insensitive comparison, accent-insensitive comparison, as well as comparison of strings in different Unicode normal forms. It is up to the collation provider to actually implement such insensitive comparisons; the deterministic flag only determines whether ties are to be broken using bytewise comparison. See also Unicode Technical Standard 10 for more information on the terminology.
To create a nondeterministic collation, specify the property deterministic = false to CREATE COLLATION, for example:
CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
This example would use the standard Unicode collation in a nondeterministic way. In particular, this would allow strings in different normal forms to be compared correctly. More interesting examples make use of the ICU customization facilities explained above. For example:
CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
All standard and predefined collations are deterministic, all user-defined collations are deterministic by default. While nondeterministic collations give a more “correct” behavior, especially when considering the full power of Unicode and its many special cases, they also have some drawbacks. Foremost, their use leads to a performance penalty. Also, certain operations are not possible with nondeterministic collations, such as pattern matching operations. Therefore, they should be used only in cases where they are specifically wanted.
create database db1 with template template0 encoding 'UTF8';
\c db1
create table test (c1 text);
insert into test values ('a'),('b'),('c'),('A'),('B'),('C');
select * from test order by c1 collate "zh-Hans-x-icu";
c1
----
a
A
b
B
c
C
(6 rows)
db1=# select * from test order by c1 collate "zh_CN";
c1
----
a
A
b
B
c
C
(6 rows)
db1=# select * from test order by c1 collate "POSIX";
c1
----
A
B
C
a
b
c
(6 rows)
db1=# select pg_encoding_to_char(collencoding),* from pg_collation where collname='zh_CN';
pg_encoding_to_char | oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion
---------------------+-------+----------+---------------+-----------+--------------+---------------------+--------------+-------------+------------+-------------
EUC_CN | 13083 | zh_CN | 11 | 10 | c | t | 2 | zh_CN | zh_CN |
UTF8 | 13240 | zh_CN | 11 | 10 | c | t | 6 | zh_CN.utf8 | zh_CN.utf8 |
(2 rows)
db1=# select pg_encoding_to_char(collencoding),* from pg_collation where collname='zh-Hans-x-icu';
pg_encoding_to_char | oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion
---------------------+-------+---------------+---------------+-----------+--------------+---------------------+--------------+-------------+-----------+-------------
| 13829 | zh-Hans-x-icu | 11 | 10 | i | t | -1 | zh_Hans | zh_Hans | 58.0.0.50
(1 row)
db1=# select pg_encoding_to_char(collencoding),* from pg_collation where collname='POSIX';
pg_encoding_to_char | oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion
---------------------+-----+----------+---------------+-----------+--------------+---------------------+--------------+-------------+-----------+-------------
| 951 | POSIX | 11 | 10 | c | t | -1 | POSIX | POSIX |
(1 row)
db1=# CREATE COLLATION case_insensitive (provider = icu, locale = 'zh_Hans', deterministic = false);
CREATE COLLATION
db1=# select 'a'='A' collate "case_insensitive";
?column?
----------
f
(1 row)
db1=# select * from test order by c1 collate "case_insensitive";
c1
----
a
A
b
B
c
C
(6 rows)
db1=# select pg_encoding_to_char(collencoding),* from pg_collation where collname='case_insensitive';
pg_encoding_to_char | oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion
---------------------+-------+------------------+---------------+-----------+--------------+---------------------+--------------+-------------+-----------+-------------
| 18383 | case_insensitive | 2200 | 10 | i | f | -1 | zh_Hans | zh_Hans | 58.0.0.50
(1 row)
目前collate不支持=操作不区分大小写,目前需要citext插件。理论上未来可以支持。
pattern matching operations 不支持nondeterministic collate.
https://www.postgresql.org/docs/12/sql-expressions.html#SQL-SYNTAX-COLLATE-EXPRS
https://www.postgresql.org/docs/12/collation.html
https://www.postgresql.org/docs/12/sql-createcollation.html
https://www.postgresql.org/docs/12/collation.html#COLLATION-NONDETERMINISTIC
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.