|
| 1 | +## PostgreSQL 锁等待排查实践 - 珍藏级 - process xxx1 acquired RowExclusiveLock on relation xxx2 of database xxx3 after xxx4 ms at xxx |
| 2 | + |
| 3 | +### 作者 |
| 4 | +digoal |
| 5 | + |
| 6 | +### 日期 |
| 7 | +2018-06-22 |
| 8 | + |
| 9 | +### 标签 |
| 10 | +PostgreSQL , 锁等待 , ddl , 大锁 , 雪崩 , lock_timeout , deadlock_timeout , AB表切换 , 分区表 |
| 11 | + |
| 12 | +---- |
| 13 | + |
| 14 | +## 背景 |
| 15 | +当SQL请求锁等待超过deadlock_timeout指定的时间时,报类似如下日志: |
| 16 | + |
| 17 | +``` |
| 18 | +LOG: process xxx1 acquired RowExclusiveLock on relation xxx2 of database xxx3 after xxx4 ms at xxx |
| 19 | +STATEMENT: INSERT ........... |
| 20 | +``` |
| 21 | + |
| 22 | +解释: |
| 23 | + |
| 24 | +xxx1进程请求位于数据库xxx3中的xxx2对象的RowExclusiveLock锁,已等待xxx4秒。 |
| 25 | + |
| 26 | +同学们可能会纳闷,怎么insert也会等待? |
| 27 | + |
| 28 | +其实不管什么操作,都有可能出现等待,只要请求的锁与已有或已经在队列中的LOCK级别有冲突就会出现等待。 |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | +https://www.postgresql.org/docs/10/static/explicit-locking.html |
| 33 | + |
| 34 | +src/include/storage/lockdefs.h |
| 35 | + |
| 36 | +``` |
| 37 | +/* NoLock is not a lock mode, but a flag value meaning "don't get a lock" */ |
| 38 | +#define NoLock 0 |
| 39 | + |
| 40 | +#define AccessShareLock 1 /* SELECT */ |
| 41 | +#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */ |
| 42 | +#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */ |
| 43 | +#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL),ANALYZE, CREATE INDEX |
| 44 | + * CONCURRENTLY */ |
| 45 | +#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */ |
| 46 | +#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW |
| 47 | + * SHARE */ |
| 48 | +#define ExclusiveLock 7 /* blocks ROW SHARE/SELECT...FOR UPDATE */ |
| 49 | +#define AccessExclusiveLock 8 /* ALTER TABLE, DROP TABLE, VACUUM FULL, |
| 50 | + * and unqualified LOCK TABLE */ |
| 51 | +``` |
| 52 | + |
| 53 | +## 如何分析? - 实例讲解 |
| 54 | +请参考万能文章: |
| 55 | + |
| 56 | +[《PostgreSQL 锁等待监控 珍藏级SQL - 谁堵塞了谁》](../201705/20170521_01.md) |
| 57 | + |
| 58 | +1、开启审计日志 |
| 59 | + |
| 60 | +``` |
| 61 | +log_destination = 'csvlog' |
| 62 | +logging_collector = on |
| 63 | +log_truncate_on_rotation = on |
| 64 | +log_statement = 'all' |
| 65 | +``` |
| 66 | + |
| 67 | +2、psql 挂一个打印锁等待的窗口 |
| 68 | + |
| 69 | +``` |
| 70 | +psql |
| 71 | + |
| 72 | +with |
| 73 | +t_wait as |
| 74 | +( |
| 75 | + select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.granted, |
| 76 | + a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a.transactionid,a.fastpath, |
| 77 | + b.state,b.query,b.xact_start,b.query_start,b.usename,b.datname,b.client_addr,b.client_port,b.application_name |
| 78 | + from pg_locks a,pg_stat_activity b where a.pid=b.pid and not a.granted |
| 79 | +), |
| 80 | +t_run as |
| 81 | +( |
| 82 | + select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.granted, |
| 83 | + a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a.transactionid,a.fastpath, |
| 84 | + b.state,b.query,b.xact_start,b.query_start,b.usename,b.datname,b.client_addr,b.client_port,b.application_name |
| 85 | + from pg_locks a,pg_stat_activity b where a.pid=b.pid and a.granted |
| 86 | +), |
| 87 | +t_overlap as |
| 88 | +( |
| 89 | + select r.* from t_wait w join t_run r on |
| 90 | + ( |
| 91 | + r.locktype is not distinct from w.locktype and |
| 92 | + r.database is not distinct from w.database and |
| 93 | + r.relation is not distinct from w.relation and |
| 94 | + r.page is not distinct from w.page and |
| 95 | + r.tuple is not distinct from w.tuple and |
| 96 | + r.virtualxid is not distinct from w.virtualxid and |
| 97 | + r.transactionid is not distinct from w.transactionid and |
| 98 | + r.classid is not distinct from w.classid and |
| 99 | + r.objid is not distinct from w.objid and |
| 100 | + r.objsubid is not distinct from w.objsubid and |
| 101 | + r.pid <> w.pid |
| 102 | + ) |
| 103 | +), |
| 104 | +t_unionall as |
| 105 | +( |
| 106 | + select r.* from t_overlap r |
| 107 | + union all |
| 108 | + select w.* from t_wait w |
| 109 | +) |
| 110 | +select locktype,datname,relation::regclass,page,tuple,virtualxid,transactionid::text,classid::regclass,objid,objsubid, |
| 111 | +string_agg( |
| 112 | +'Pid: '||case when pid is null then 'NULL' else pid::text end||chr(10)|| |
| 113 | +'Lock_Granted: '||case when granted is null then 'NULL' else granted::text end||' , Mode: '||case when mode is null then 'NULL' else mode::text end||' , FastPath: '||case when fastpath is null then 'NULL' else fastpath::text end||' , VirtualTransaction: '||case when virtualtransaction is null then 'NULL' else virtualtransaction::text end||' , Session_State: '||case when state is null then 'NULL' else state::text end||chr(10)|| |
| 114 | +'Username: '||case when usename is null then 'NULL' else usename::text end||' , Database: '||case when datname is null then 'NULL' else datname::text end||' , Client_Addr: '||case when client_addr is null then 'NULL' else client_addr::text end||' , Client_Port: '||case when client_port is null then 'NULL' else client_port::text end||' , Application_Name: '||case when application_name is null then 'NULL' else application_name::text end||chr(10)|| |
| 115 | +'Xact_Start: '||case when xact_start is null then 'NULL' else xact_start::text end||' , Query_Start: '||case when query_start is null then 'NULL' else query_start::text end||' , Xact_Elapse: '||case when (now()-xact_start) is null then 'NULL' else (now()-xact_start)::text end||' , Query_Elapse: '||case when (now()-query_start) is null then 'NULL' else (now()-query_start)::text end||chr(10)|| |
| 116 | +'SQL (Current SQL in Transaction): '||chr(10)|| |
| 117 | +case when query is null then 'NULL' else query::text end, |
| 118 | +chr(10)||'--------'||chr(10) |
| 119 | +order by |
| 120 | + ( case mode |
| 121 | + when 'INVALID' then 0 |
| 122 | + when 'AccessShareLock' then 1 |
| 123 | + when 'RowShareLock' then 2 |
| 124 | + when 'RowExclusiveLock' then 3 |
| 125 | + when 'ShareUpdateExclusiveLock' then 4 |
| 126 | + when 'ShareLock' then 5 |
| 127 | + when 'ShareRowExclusiveLock' then 6 |
| 128 | + when 'ExclusiveLock' then 7 |
| 129 | + when 'AccessExclusiveLock' then 8 |
| 130 | + else 0 |
| 131 | + end ) desc, |
| 132 | + (case when granted then 0 else 1 end) |
| 133 | +) as lock_conflict |
| 134 | +from t_unionall |
| 135 | +group by |
| 136 | +locktype,datname,relation,page,tuple,virtualxid,transactionid::text,classid,objid,objsubid ; |
| 137 | + |
| 138 | +\watch 0.2 |
| 139 | +``` |
| 140 | + |
| 141 | +3、tail 挂一个日志观测窗口 |
| 142 | + |
| 143 | +``` |
| 144 | +for ((i=1;i>0;i=1)); do grep RowExclusiveLock *.csv ; sleep 0.2; done |
| 145 | + |
| 146 | +或 |
| 147 | + |
| 148 | +for ((i=1;i>0;i=1)); do grep acquired *.csv ; sleep 0.2; done |
| 149 | +``` |
| 150 | + |
| 151 | +4、发现问题 |
| 152 | + |
| 153 | +在业务运行过程中,如果问题复现,一定能观测到日志。 |
| 154 | + |
| 155 | +通过锁等待的窗口,观测到其中一个会话对表xxx持有了```accessExclusiveLock```,也就是排他锁。通过前面的锁冲突表,你会发现这个锁和所有锁都冲突(实际上```DDL,VACUUM FULL```等操作都会持有排它锁,或者人为的发出```lock table xxx in access exclusive mode;```) |
| 156 | + |
| 157 | +所以,当然会堵塞其他对该表的INSERT操作了。 |
| 158 | + |
| 159 | +4\.1、根据锁等待的窗口拿到持有锁的PID,到审计日志里面查看这个PID在当前事务中,前面都发起了什么SQL。 |
| 160 | + |
| 161 | +4\.2、找到问题根源,原来这个事务发起了```ALTER TABLE XXX RENAME TO XXXXX;```的动作。 |
| 162 | + |
| 163 | +**这个事件也引起了INSERT变慢(实际上是锁等待,实践都花在了等待上面)的问题。** |
| 164 | + |
| 165 | +很显然,业务上应该经常会有一些触发改表名的动作,比如为了防止一个表太大,经常做AB表切换的动作。 |
| 166 | + |
| 167 | +虽然切换表名只是修改元数据,但是这个瞬间的锁,在高并发的业务场景中,也会带来堵塞危害。 |
| 168 | + |
| 169 | +建议用户采用分区表,而不要在高并发业务中频繁使用DDL这样的大锁高危操作。 |
| 170 | + |
| 171 | +## 小结 |
| 172 | +本文讲解了锁等待问题的排查方法。原因实际上是用户在业务中使用了AB表切换,虽然切换表名只是修改元数据,但是这个瞬间的锁,在高并发的业务场景中,也会带来堵塞危害。 |
| 173 | + |
| 174 | +建议用户采用分区表,而不要在高并发业务中频繁使用DDL这样的大锁高危操作。 |
| 175 | + |
| 176 | +[《PostgreSQL 9.x, 10, 11 hash分区表 用法举例》](../201805/20180524_05.md) |
| 177 | + |
| 178 | +[《PostgreSQL 11 preview - 分区表 增强 汇总》](../201805/20180519_01.md) |
| 179 | + |
| 180 | +[《PostgreSQL 查询涉及分区表过多导致的性能问题 - 性能诊断与优化(大量BIND, spin lock, SLEEP进程)》](../201801/20180124_01.md) |
| 181 | + |
| 182 | +[《PostgreSQL 商用版本EPAS(阿里云ppas(Oracle 兼容版)) - 分区表性能优化 (堪比pg_pathman)》](../201801/20180122_03.md) |
| 183 | + |
| 184 | +[《PostgreSQL 传统 hash 分区方法和性能》](../201711/20171122_02.md) |
| 185 | + |
| 186 | +[《PostgreSQL 10 内置分区 vs pg_pathman perf profiling》](../201710/20171015_01.md) |
| 187 | + |
| 188 | +[《PostgreSQL 10.0 preview 功能增强 - 内置分区表》](../201612/20161215_01.md) |
| 189 | + |
| 190 | +[《PostgreSQL 9.5+ 高效分区表实现 - pg_pathman》](../201610/20161024_01.md) |
| 191 | + |
| 192 | +另外。大锁操作,一定要注意防止雪崩。 |
| 193 | + |
| 194 | +最后,遇到问题要冷静思考,不要轻易认为数据库有问题。 |
| 195 | + |
| 196 | + |
| 197 | +## 参考 |
| 198 | +[《PostgreSQL 锁等待监控 珍藏级SQL - 谁堵塞了谁》](../201705/20170521_01.md) |
| 199 | + |
| 200 | +[《PostgreSQL 设置单条SQL的执行超时 - 防雪崩》](201712/20171211_02.md) |
| 201 | + |
| 202 | +[《如何防止数据库雪崩(泛洪 flood)》](201609/20160909_01.md) |
| 203 | + |
| 204 | + |
| 205 | + |
| 206 | +<a rel="nofollow" href="http://info.flagcounter.com/h9V1" ><img src="http://s03.flagcounter.com/count/h9V1/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_12/viewers_0/labels_0/pageviews_0/flags_0/" alt="Flag Counter" border="0" ></a> |
| 207 | + |
0 commit comments