improve

digoal · digoal · commit 1350afe6ae28 · 2018-07-04T18:39:25.000+08:00
diff --git a/201807/20180704_02.md b/201807/20180704_02.md
@@ -0,0 +1,114 @@
+## PostgreSQL 如何检测分布式死锁 - postgrespro pg_shardman   
+                                                                 
+### 作者                                                                 
+digoal                                                                 
+                                                                 
+### 日期                                                                 
+2018-07-04                                                               
+                                                                 
+### 标签                                                                 
+PostgreSQL , postgrespro , pg_shardman , 分布式死锁    
+                                                                 
+----                                                                 
+                                                                 
+## 背景   
+单机的死锁检测是比较容易实现的，通过判断等待中的事务是否查询环状，即可实现。例如下面的图，A等B, B等C, C等A，出现环状，即认为死锁。  
+  
+![pic](20180704_02_pic_001.jpg)  
+  
+而对于分布式事务，由于相互等待的事务可能出现在不同的数据节点上，如何判断分布式死锁呢？  
+  
+为了判断是否有环，必须引入全局事务号，或者有标识可以在每个数据节点上标识出来对应的事务。这样才能画出等待图形，判断是否出现环状。  
+  
+另一方面由于判断分布式事务是否出现死锁，在每个数据节点上读取锁等待信息，所以不同的数据节点读取等待信息是有时间差的，因此仅凭一次的环无法判断是否真的死锁（例如你在查询到等待后，查看其它节点时，实际上之前的等待已经消失的情况，是“假死锁”），需要两次查询，并且等待环（包括GID在内等信息完全一致的环）出现在两次查询中时，这个环才认为是真正的死锁。  
+  
+![pic](20180704_02_pic_002.jpg)  
+  
+## postgrespro pg_shardman  
+pg_shardman 没有内置死锁检测的功能，而是通过开放一个UDF接口，用户调用这个接口，无限循环，检测是否有分布式死锁，随机cancel造成分布式死锁的其中一个query。  
+  
+```  
+monitor(check_timeout_sec int = 5, rm_node_timeout_sec int = 60)  
+```  
+  
+Monitor cluster for presence of distributed deadlocks and node failures. This function is intended to be executed at shardlord and is redirected to shardlord been launched at any other node. It starts infinite loop which polls all clusters nodes, collecting local ```lock graphs``` from all nodes. Period of poll is specified by ```check_timeout_sec``` parameter (default value is 5 seconds). Local lock graphs are combined into global lock graph which is analyzed for the presence of loops. A loop in the lock graph means distributed deadlock. Monitor function tries to resolve deadlock by canceling one or more backends involved in the deadlock loop (using ```pg_cancel_backend``` function, which doesn't actually terminate backend but tries to cancel current query). Canceled backend is randomly chosen within deadlock loop. Since not all deadlock members are hanged in 'active query' state, it might be needed to send cancel several times.  
+  
+Since local graphs collected from all nodes do not form consistent global snapshot, false postives are possible: edges in deadlock loop correspond to different moment of times. To prevent false deadlock detection, monitor function doesn't react on detected deadlock immediately. Instead of it, previous deadlock loop located at previous iteration is compared with current deadlock loop and only if they are equal, deadlock is reported and resolving is performed.  
+  
+If some node is unreachable then monitor function prints correspondent error message and retries access until ```rm_node_timeout_sec``` timeout expiration. After it node is removed from the cluster using ```shardman.rm_node``` function. If redundancy level is non-zero, then primary partitions from the disabled node are replaced with replicas. Finally ```pg_shardman``` performs recovery of distributed transactions for which failed node was the coordinator. It is done using ```shardman.recover_xacts()``` function which collects status of distributed transaction at all participants and tries to make decision whether it should be committed or aborted. If ```rm_node_timeout_sec``` is ```NULL```, ```monitor``` will not remove nodes.  
+  
+## 参考  
+https://github.com/postgrespro/pg_shardman/blob/native_partitioning/pg_shardman--0.0.3.sql  
+  
+pg_shardman提供的分布式死锁检测依赖的一些对象。    
+   
+```  
+-- Type to represent vertex in lock graph  
+create type process as (node int, pid int);  
+  
+-- View to build lock graph which can be used to detect global deadlock.  
+-- Application_name is assumed pgfdw:$system_id:$coord_pid  
+-- gid is assumed pgfdw:$timestamp:$sys_id:$pid:$xid:$participants_count:$coord_count  
+-- Currently we are oblivious about lock modes and report any wait -> hold edge  
+-- on the same object and therefore might produce false loops. Furthermore,  
+-- we have not idea about locking queues here. Probably it is better to use  
+-- pg_blocking_pids, but it seems to ignore prepared xacts.  
+CREATE VIEW lock_graph(wait, hold) AS  
+	-- local dependencies  
+    -- If xact is already prepared, we take node and pid of the coordinator.  
+	SELECT  
+		ROW(shardman.get_my_id(),  
+			wait.pid)::shardman.process,  
+	 	CASE WHEN hold.pid IS NOT NULL THEN  
+		    ROW(shardman.get_my_id(), hold.pid)::shardman.process  
+		ELSE -- prepared  
+			ROW(shardman.get_node_by_sysid(split_part(gid, ':', 3)::bigint),  
+				split_part(gid, ':', 4)::int)::shardman.process  
+		END  
+     FROM pg_locks wait, pg_locks hold LEFT OUTER JOIN pg_prepared_xacts twopc  
+			  ON twopc.transaction=hold.transactionid  
+	WHERE  
+		NOT wait.granted AND wait.pid IS NOT NULL AND hold.granted AND  
+		-- waiter waits for the the object holder locks  
+		wait.database IS NOT DISTINCT FROM hold.database AND  
+		wait.relation IS NOT DISTINCT FROM hold.relation AND  
+		wait.page IS NOT DISTINCT FROM hold.page AND  
+		wait.tuple IS NOT DISTINCT FROM hold.tuple AND  
+		wait.virtualxid IS NOT DISTINCT FROM hold.virtualxid AND  
+		wait.transactionid IS NOT DISTINCT FROM hold.transactionid AND -- waiting on xid  
+		wait.classid IS NOT DISTINCT FROM hold.classid AND  
+		wait.objid IS NOT DISTINCT FROM hold.objid AND  
+		wait.objsubid IS NOT DISTINCT FROM hold.objsubid AND  
+		 -- this is most probably truism, but who knows  
+		(hold.pid IS NOT NULL OR twopc.gid IS NOT NULL)  
+	UNION ALL  
+	-- if this fdw backend is busy, potentially waiting, add edge coordinator -> fdw  
+	SELECT ROW(shardman.get_node_by_sysid(split_part(application_name, ':', 2)::bigint),  
+			   split_part(application_name,':',3)::int)::shardman.process,  
+		   ROW(shardman.get_my_id(),  
+			   pid)::shardman.process  
+	FROM pg_stat_activity WHERE application_name LIKE 'pgfdw:%' AND wait_event<>'ClientRead'  
+	UNION ALL  
+	-- otherwise, coordinator itself is busy, potentially waiting, so add fdw ->  
+	-- coordinator edge  
+	SELECT ROW(shardman.get_my_id(),  
+			   pid)::shardman.process,  
+		   ROW(shardman.get_node_by_sysid(split_part(application_name,':',2)::bigint),  
+			   split_part(application_name,':',3)::int)::shardman.process  
+	FROM pg_stat_activity WHERE application_name LIKE 'pgfdw:%' AND wait_event='ClientRead';  
+  
+-- Pack lock graph into comma-separated string of edges like "2:17439->4:30046",  
+-- i.e. pid 17439 on node 2 waits for pid 30046 on node 4  
+CREATE FUNCTION serialize_lock_graph() RETURNS TEXT AS $$  
+	SELECT COALESCE(  
+		string_agg((wait).node || ':' || (wait).pid || '->' ||  
+				   (hold).node || ':' || (hold).pid,  
+				   ','),  
+		'')  
+	FROM shardman.lock_graph;  
+$$ LANGUAGE sql;  
+```  
+    
+  
+<a rel="nofollow" href="http://info.flagcounter.com/h9V1"  ><img src="http://s03.flagcounter.com/count/h9V1/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_12/viewers_0/labels_0/pageviews_0/flags_0/"  alt="Flag Counter"  border="0"  ></a>  
+  
diff --git a/201807/20180704_02_pic_001.jpg b/201807/20180704_02_pic_001.jpg
diff --git a/201807/20180704_02_pic_002.jpg b/201807/20180704_02_pic_002.jpg
diff --git a/201807/20180704_04.md b/201807/20180704_04.md
@@ -0,0 +1,93 @@
+## PostgreSQL quorum based同步复制模式在极端情况下的0丢失破坏问题    
+                                                                   
+### 作者                                                                   
+digoal                                                                   
+                                                                   
+### 日期                                                                   
+2018-07-04                                                                 
+                                                                   
+### 标签                                                                   
+PostgreSQL , quorom based sync replication , 2PC , 0丢失       
+                                                                   
+----                                                                   
+                                                                   
+## 背景    
+PostgreSQL 9.6开始支持了quorum based 同步复制机制，当客户端发起事务结束请求时，必须要等这笔事务对应的redo日志复制到了指定副本，主库才响应客户端。    
+    
+从而保证客户端正常收到主库反馈后，WAL日志一定已经有了多个副本，保证数据的0丢失。    
+    
+但是在极端情况下，可能无法保障0丢失，为什么呢？    
+    
+因为WAL日志先落主库，然后等备库复制，最后反馈客户端。    
+    
+比如用户提交事务前，备库挂了，实际上主库本地的WAL已经写了，数据在本地已经持久化。只是主库要等符合quorum based个数备库那边接收到WAL的位点反馈，才反馈给客户端提交成功。    
+    
+因此，当客户端在等待过程中，如果连接中断，就会造成本地已提交，但是有可能某些备库没有收到WAL的情况。    
+    
+客户端如果重新与主库建立连接，它可以看到已提交的数据。    
+    
+在pg_shardman的介绍中，也提到了类似的问题    
+    
+https://github.com/postgrespro/pg_shardman    
+    
+    
+The trade-off is well-known: asynchronous replication is faster, but allows replica to lag arbitrary behind the primary, which might lead to loss of a bunch of recently committed transactions (if primary holder fails), or WAL puffing up in case of replica failure. Synchronous replication is slower, but committed transaction are typically not dropped. Typically, because it is actually still possible to lose them without kind of 2PC commit. Imagine the following scenario:    
+    
+- Primary's connection with replica is teared down.    
+- Primary executes a transaction, e.g. adds some row with id 42, commits it locally and blocks because there is no connection with replica.    
+- Client suddenly loses connection with primary for a moment and reconnects to learn the status of the transaction, sees the row with id 42 and thinks that it has been committed.    
+- Now primary fails permanently and we switch to the replica. Replica has no idea of that transaction, but client is sure it is committed.    
+    
+### 2PC可以保证一次客户端连接丢失带来的问题    
+    
+```    
+预提交成功    
+    
+2PC提交    
+    
+备库挂    
+    
+主库2PC提交成功    
+    
+客户端丢失连接    
+    
+客户端重新发起连接，发现主库已经2PC提交成功，可以看到数据    
+    
+主库挂掉    
+    
+切换到备库，客户端重新发起连接，发现备库还没有数据，因为2PC还没有结束，可以人为介入处理，不会丢数据。    
+```    
+    
+### 2PC不可以保证超过一次客户端连接丢失带来的问题    
+    
+```    
+预提交    
+    
+备库挂    
+    
+主库预提交成功    
+    
+客户端丢失连接    
+    
+客户端重新发起连接，发现主库已经预提交成功，发起2PC提交    
+    
+客户端丢失连接    
+    
+客户端重新发起连接，发现主库已经2PC提交成功，可以看到数据    
+    
+主库挂掉    
+    
+切换到备库，客户端重新发起连接，发现备库还没有数据，丢数据。    
+```    
+    
+## 小结    
+当备库挂掉，无法满足quorum时，客户端提交事务后，会处于等待状态，但是如果客户端丢失连接，再次发起请求，可以看到仅仅在主库提交的数据。对于2PC事务，只要多次丢失连接，同样会造成问题。    
+    
+## 参考  
+[《PG多节点(quorum based), 0丢失 HA(failover,switchover)方案》](../201706/20170612_02.md)    
+    
+[《PostgreSQL 一主多从(多副本,强同步)简明手册 - 配置、压测、监控、切换、防脑裂、修复、0丢失 - 珍藏级》](../201803/20180326_01.md)    
+    
+  
+<a rel="nofollow" href="http://info.flagcounter.com/h9V1"  ><img src="http://s03.flagcounter.com/count/h9V1/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_12/viewers_0/labels_0/pageviews_0/flags_0/"  alt="Flag Counter"  border="0"  ></a>  
+  
diff --git a/201807/readme.md b/201807/readme.md
@@ -2,6 +2,8 @@
   
 ### 文章列表  
 ----  
+##### 20180704_04.md   [《PostgreSQL quorum based同步复制模式在极端情况下的0丢失破坏问题》](20180704_04.md)  
+##### 20180704_02.md   [《PostgreSQL 如何检测分布式死锁 - postgrespro pg_shardman》](20180704_02.md)  
 ##### 20180704_01.md   [《PostgreSQL 模拟两个update语句死锁 - 利用扫描方法》](20180704_01.md)  
 ##### 20180703_02.md   [《PostgreSQL 电商业务(任意维度商品圈选应用) - json包range数组的命中优化 - 展开+索引优化》](20180703_02.md)  
 ##### 20180703_01.md   [《PostgreSQL 函数式索引使用注意 - 暨非immutable函数不适合索引的原因》](20180703_01.md)  
diff --git a/README.md b/README.md
@@ -31,6 +31,8 @@ digoal's|PostgreSQL|文章|归类
   
 ### 未归类文档如下  
 ----  
+##### 201807/20180704_04.md   [《PostgreSQL quorum based同步复制模式在极端情况下的0丢失破坏问题》](201807/20180704_04.md)  
+##### 201807/20180704_02.md   [《PostgreSQL 如何检测分布式死锁 - postgrespro pg_shardman》](201807/20180704_02.md)  
 ##### 201807/20180704_01.md   [《PostgreSQL 模拟两个update语句死锁 - 利用扫描方法》](201807/20180704_01.md)  
 ##### 201807/20180703_02.md   [《PostgreSQL 电商业务(任意维度商品圈选应用) - json包range数组的命中优化 - 展开+索引优化》](201807/20180703_02.md)  
 ##### 201807/20180703_01.md   [《PostgreSQL 函数式索引使用注意 - 暨非immutable函数不适合索引的原因》](201807/20180703_01.md)