Skip to content

chunkservers get disconnected #230

@szabolcsf

Description

@szabolcsf

We have deployed qfs 2.0 on a cluster of 200 dedicated servers with Debian 9 x86-64. Each have 10 hard drives, we're running one chunkserver per hard drive. We only use one metaserver on a separate server.
These servers have nothing else running on them. They don't suffer from high load or any resource bottlenecks.

Now there's a problem: every time we start writing to the cluster chunkservers get temporarily disconnected and end up in the dead nodes history. It doesn't matter how much data we write, a few megabytes are enough to trigger the problem.

This is how it looks like from the metaserver:
08-07-2018 01:08:34.061 DEBUG - (NetConnection.cc:108) netconn: 1262 read: Connection reset by peer 104 08-07-2018 01:08:34.061 ERROR - (ChunkServer.cc:1304) 10.10.1.5 21014 / 10.10.1.5:42806 chunk server down reason: communication error socket: good: 0 status: Connection reset by peer 104 -104 08-07-2018 01:08:34.061 INFO - (LayoutManager.cc:6019) server down: 10.10.1.5 21014 block count: 36595 master: 0 replay: 0 reason: communication error; Connection reset by peer 104 chunk-server-bye: 10.10.1.5 21014 logseq: 0 0 893823745 chunks: 36595 checksum: 0 2030060475277 36752 log: in flight: 0 repl delay: 7200 completion: no

and this is how it looks like on a chunkserver:
08-07-2018 01:08:34.036 ERROR - (MetaServerSM.cc:592) 10.10.1.3 20100 meta server inactivity timeout, last request received: 17 secs ago timeout: inactivity: 40 receive: 16 08-07-2018 01:08:34.036 ERROR - (MetaServerSM.cc:1043) 10.10.1.3 20100 closing meta server connection due to receive timeout

The issue is easy to reproduce, a simple qfscat | qfsput triggers it. When we stop all write activity, there's no disconnect error as long as there's no write to the cluster. I.e. if we don't write for days, there's no chunkserver error for days.

One thing to note: when the disconnect error occurs, only one single chunkserver from a given node is disconnected, the remaining 9 chunkservers running on that node are fine. There's no pattern to which chunkservers get disconnected, meaning, it's always random. One node node it's chunkserver01, on another node it's chunckserver09 and so on.

When there's write activity, all nodes are affected with the disconnect error. The issue is almost evenly spread on nodes, doesn't depend on rack location. No nodes have strangely low or high amount of triggered errors compared to others.

None of the nodes have any lost TX/RX packets. There are no hardware issues and networking works properly.

This is a tcpdump packet capture about the issue:
891162 2018-08-14 13:57:21.006309 10.10.1.3 → 10.10.1.100 TCP 122 20100 → 46630 [PSH, ACK] Seq=53247 Ack=2318595 Win=32038 Len=52 TSval=873627156 TSecr=3858378234 891163 2018-08-14 13:57:21.006469 10.10.1.100 → 10.10.1.3 TCP 70 46630 → 20100 [RST, ACK] Seq=2318595 Ack=53299 Win=18888 Len=0 TSval=3858387169 TSecr=873627156 891164 2018-08-14 13:57:21.006540 10.10.1.100 → 10.10.1.3 TCP 78 45238 → 20100 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3858387169 TSecr=0 WS=2 891165 2018-08-14 13:57:21.006608 10.10.1.3 → 10.10.1.100 TCP 78 20100 → 45238 [SYN, ACK, ECN] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=873627157 TSecr=3858387169 WS=2

We have been using qfs 1.2 on a similarly sized cluster with no problem, no such chunkserver disconnect errors occurred.

Is this a regression in qfs 2.0? What can we do to help this troubleshooting and get fixed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions