chunkservers get disconnected

We have deployed qfs 2.0 on a cluster of 200 dedicated servers with Debian 9 x86-64. Each have 10 hard drives, we're running one chunkserver per hard drive. We only use one metaserver on a separate server.
These servers have nothing else running on them. They don't suffer from high load or any resource bottlenecks. 

Now there's a problem: every time we start writing to the cluster chunkservers get temporarily disconnected and end up in the dead nodes history. It doesn't matter how much data we write, a few megabytes are enough to trigger the problem.

This is how it looks like from the metaserver:
`08-07-2018 01:08:34.061 DEBUG - (NetConnection.cc:108) netconn: 1262 read: Connection reset by peer 104
08-07-2018 01:08:34.061 ERROR - (ChunkServer.cc:1304) 10.10.1.5 21014 / 10.10.1.5:42806 chunk server down reason: communication error socket: good: 0 status: Connection reset by peer 104 -104
08-07-2018 01:08:34.061 INFO - (LayoutManager.cc:6019) server down: 10.10.1.5 21014 block count: 36595 master: 0 replay: 0 reason: communication error; Connection reset by peer 104 chunk-server-bye: 10.10.1.5 21014 logseq: 0 0 893823745 chunks: 36595 checksum: 0 2030060475277 36752 log: in flight: 0 repl delay: 7200 completion: no`

and this is how it looks like on a chunkserver:
`08-07-2018 01:08:34.036 ERROR - (MetaServerSM.cc:592) 10.10.1.3 20100 meta server inactivity timeout, last request received: 17 secs ago timeout: inactivity: 40 receive: 16
08-07-2018 01:08:34.036 ERROR - (MetaServerSM.cc:1043) 10.10.1.3 20100 closing meta server connection due to receive timeout`

The issue is easy to reproduce, a simple `qfscat | qfsput` triggers it. When we stop all write activity, there's no disconnect error as long as there's no write to the cluster. I.e. if we don't write for days, there's no chunkserver error for days.

One thing to note: when the disconnect error occurs, only one single chunkserver from a given node is disconnected, the remaining 9 chunkservers running on that node are fine. There's no pattern to which chunkservers get disconnected, meaning, it's always random. One node node it's chunkserver01, on another node it's chunckserver09 and so on.

When there's write activity, all nodes are affected with the disconnect error. The issue is almost evenly spread on nodes, doesn't depend on rack location. No nodes have strangely low or high amount of triggered errors compared to others.

None of the nodes have any lost TX/RX packets. There are no hardware issues and networking works properly.

This is a tcpdump packet capture about the issue:
`891162 2018-08-14 13:57:21.006309   10.10.1.3 → 10.10.1.100 TCP 122 20100 → 46630 [PSH, ACK] Seq=53247 Ack=2318595 Win=32038 Len=52 TSval=873627156 TSecr=3858378234
891163 2018-08-14 13:57:21.006469 10.10.1.100 → 10.10.1.3   TCP 70 46630 → 20100 [RST, ACK] Seq=2318595 Ack=53299 Win=18888 Len=0 TSval=3858387169 TSecr=873627156
891164 2018-08-14 13:57:21.006540 10.10.1.100 → 10.10.1.3   TCP 78 45238 → 20100 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3858387169 TSecr=0 WS=2
891165 2018-08-14 13:57:21.006608   10.10.1.3 → 10.10.1.100 TCP 78 20100 → 45238 [SYN, ACK, ECN] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=873627157 TSecr=3858387169 WS=2`

We have been using qfs 1.2 on a similarly sized cluster with no problem, no such chunkserver disconnect errors occurred. 

Is this a regression in qfs 2.0? What can we do to help this troubleshooting and get fixed? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chunkservers get disconnected #230

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

chunkservers get disconnected #230

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions