Skip to content

Files

Latest commit

Dec 15, 2023
f400df2 · Dec 15, 2023

History

History
315 lines (192 loc) · 13.5 KB

20191205_01.md

File metadata and controls

315 lines (192 loc) · 13.5 KB

PostgreSQL 12 preview - icu collate,支持大小写不敏感、口音不敏感排序

作者

digoal

日期

2019-12-05

标签

PostgreSQL , icu , collate , 排序 , 大小写不敏感


背景

PG 12 开始支持icu的collate 不区分大小写,或者区分大小写的排序。

https://www.postgresql.org/docs/12/sql-createcollation.html

CREATE COLLATION [ IF NOT EXISTS ] name (  
    [ LOCALE = locale, ]  
    [ LC_COLLATE = lc_collate, ]  
    [ LC_CTYPE = lc_ctype, ]  
    [ PROVIDER = provider, ]  
    [ DETERMINISTIC = boolean, ]  
    [ VERSION = version ]  
)  
CREATE COLLATION [ IF NOT EXISTS ] name FROM existing_collation  

DETERMINISTIC

Specifies whether the collation should use deterministic comparisons. The default is true. A deterministic comparison considers strings that are not byte-wise equal to be unequal even if they are considered logically equal by the comparison. PostgreSQL breaks ties using a byte-wise comparison. Comparison that is not deterministic can make the collation be, say, case- or accent-insensitive. For that, you need to choose an appropriate LC_COLLATE setting and set the collation to not deterministic here.

Nondeterministic collations are only supported with the ICU provider.

  • 默认:deterministic
byte-wise equal to be unequal   
  • 口音、大小写不敏感比较:not deterministic,同时设置好LC_COLLATE,当前仅支持ICU provider.

DETERMINISTIC关键字表示这个collate是否支持不区分大小写。

Comparison that is not deterministic can make the collation be, say, case- or accent-insensitive. For that, you need to choose an appropriate LC_COLLATE setting and set the collation to not deterministic here.

内置的collate查询如下例子:

db1=# select pg_encoding_to_char(collencoding), * from pg_collation ;  
 pg_encoding_to_char |  oid  |        collname        | collnamespace | collowner | collprovider | collisdeterministic | collencoding |      collcollate      |       collctype       | collversion   
---------------------+-------+------------------------+---------------+-----------+--------------+---------------------+--------------+-----------------------+-----------------------+-------------  
                     |   100 | default                |            11 |        10 | d            | t                   |           -1 |                       |                       |   
                     |   950 | C                      |            11 |        10 | c            | t                   |           -1 | C                     | C                     |   
                     |   951 | POSIX                  |            11 |        10 | c            | t                   |           -1 | POSIX                 | POSIX                 |   
 UTF8                | 12326 | ucs_basic              |            11 |        10 | c            | t                   |            6 | C                     | C                     |   
 LATIN1              | 12327 | aa_DJ                  |            11 |        10 | c            | t                   |            8 | aa_DJ                 | aa_DJ                 |   
 LATIN1              | 12328 | aa_DJ.iso88591         |            11 |        10 | c            | t                   |            8 | aa_DJ.iso88591        | aa_DJ.iso88591        |   
 UTF8                | 12329 | aa_DJ.utf8             |            11 |        10 | c            | t                   |            6 | aa_DJ.utf8            | aa_DJ.utf8            |   
........  

几个ICU相关的collate例子:

-[ RECORD 1506 ]----+-----------------------  
pg_encoding_to_char |   
oid                 | 13828  
collname            | zh-x-icu  
collnamespace       | 11  
collowner           | 10  
collprovider        | i  
collisdeterministic | t  
collencoding        | -1  
collcollate         | zh  
collctype           | zh  
collversion         | 58.0.0.50  
-[ RECORD 1507 ]----+-----------------------  
pg_encoding_to_char |   
oid                 | 13829  
collname            | zh-Hans-x-icu  
collnamespace       | 11  
collowner           | 10  
collprovider        | i  
collisdeterministic | t  
collencoding        | -1  
collcollate         | zh_Hans  
collctype           | zh_Hans  
collversion         | 58.0.0.50  
-[ RECORD 1508 ]----+-----------------------  
pg_encoding_to_char |   
oid                 | 13830  
collname            | zh-Hans-CN-x-icu  
collnamespace       | 11  
collowner           | 10  
collprovider        | i  
collisdeterministic | t  
collencoding        | -1  
collcollate         | zh_Hans_CN  
collctype           | zh_Hans_CN  
collversion         | 58.0.0.50  

https://www.postgresql.org/docs/12/collation.html#COLLATION-NONDETERMINISTIC

23.2.2.4. Nondeterministic Collations

A collation is either deterministic or nondeterministic. A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if they consist of the same byte sequence. Nondeterministic comparison may determine strings to be equal even if they consist of different bytes. Typical situations include case-insensitive comparison, accent-insensitive comparison, as well as comparison of strings in different Unicode normal forms. It is up to the collation provider to actually implement such insensitive comparisons; the deterministic flag only determines whether ties are to be broken using bytewise comparison. See also Unicode Technical Standard 10 for more information on the terminology.

To create a nondeterministic collation, specify the property deterministic = false to CREATE COLLATION, for example:

CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);  

This example would use the standard Unicode collation in a nondeterministic way. In particular, this would allow strings in different normal forms to be compared correctly. More interesting examples make use of the ICU customization facilities explained above. For example:

CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false);  
CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);  

All standard and predefined collations are deterministic, all user-defined collations are deterministic by default. While nondeterministic collations give a more “correct” behavior, especially when considering the full power of Unicode and its many special cases, they also have some drawbacks. Foremost, their use leads to a performance penalty. Also, certain operations are not possible with nondeterministic collations, such as pattern matching operations. Therefore, they should be used only in cases where they are specifically wanted.

例子

create database db1 with template template0 encoding 'UTF8';  
  
\c db1  
  
create table test (c1 text);  
insert into test values ('a'),('b'),('c'),('A'),('B'),('C');  
  
select * from test order by c1 collate "zh-Hans-x-icu";  
  
 c1   
----  
 a  
 A  
 b  
 B  
 c  
 C  
(6 rows)  
  
  
db1=# select * from test order by c1 collate "zh_CN";  
 c1   
----  
 a  
 A  
 b  
 B  
 c  
 C  
(6 rows)  
  
db1=# select * from test order by c1 collate "POSIX";  
 c1   
----  
 A  
 B  
 C  
 a  
 b  
 c  
(6 rows)  
db1=# select pg_encoding_to_char(collencoding),* from pg_collation where collname='zh_CN';  
 pg_encoding_to_char |  oid  | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype  | collversion   
---------------------+-------+----------+---------------+-----------+--------------+---------------------+--------------+-------------+------------+-------------  
 EUC_CN              | 13083 | zh_CN    |            11 |        10 | c            | t                   |            2 | zh_CN       | zh_CN      |   
 UTF8                | 13240 | zh_CN    |            11 |        10 | c            | t                   |            6 | zh_CN.utf8  | zh_CN.utf8 |   
(2 rows)  
  
db1=# select pg_encoding_to_char(collencoding),* from pg_collation where collname='zh-Hans-x-icu';  
 pg_encoding_to_char |  oid  |   collname    | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion   
---------------------+-------+---------------+---------------+-----------+--------------+---------------------+--------------+-------------+-----------+-------------  
                     | 13829 | zh-Hans-x-icu |            11 |        10 | i            | t                   |           -1 | zh_Hans     | zh_Hans   | 58.0.0.50  
(1 row)  
  
db1=# select pg_encoding_to_char(collencoding),* from pg_collation where collname='POSIX';  
 pg_encoding_to_char | oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion   
---------------------+-----+----------+---------------+-----------+--------------+---------------------+--------------+-------------+-----------+-------------  
                     | 951 | POSIX    |            11 |        10 | c            | t                   |           -1 | POSIX       | POSIX     |   
(1 row)  
  
  
db1=# CREATE COLLATION case_insensitive (provider = icu, locale = 'zh_Hans', deterministic = false);  
CREATE COLLATION  
  
db1=# select 'a'='A' collate "case_insensitive";  
 ?column?   
----------  
 f  
(1 row)  
  
db1=# select * from test order by c1 collate "case_insensitive";  
 c1   
----  
 a  
 A  
 b  
 B  
 c  
 C  
(6 rows)  
  
db1=# select pg_encoding_to_char(collencoding),* from pg_collation where collname='case_insensitive';  
 pg_encoding_to_char |  oid  |     collname     | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion   
---------------------+-------+------------------+---------------+-----------+--------------+---------------------+--------------+-------------+-----------+-------------  
                     | 18383 | case_insensitive |          2200 |        10 | i            | f                   |           -1 | zh_Hans     | zh_Hans   | 58.0.0.50  
(1 row)  

小结

目前collate不支持=操作不区分大小写,目前需要citext插件。理论上未来可以支持。

pattern matching operations 不支持nondeterministic collate.

参考

https://www.postgresql.org/docs/12/sql-expressions.html#SQL-SYNTAX-COLLATE-EXPRS

https://www.postgresql.org/docs/12/collation.html

https://www.postgresql.org/docs/12/sql-createcollation.html

https://www.postgresql.org/docs/12/collation.html#COLLATION-NONDETERMINISTIC

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

digoal's wechat