riak_repl_aae full-sync deadlock

When running full-sync between a source and sink, problems can occur where both the clusters have a large number of keys, but also there is a large delta between the clusters.

Scenario:

Total Keys 20bn
Ring Size: 256
Delta: 100m difference

If an attempt to repair a partition is made, of the 1M segments in the AAE tree, about 400K segments will be in a delta state.

The aae_exchange will have to use this code to repair:

https://github.com/basho/riak_core/blob/riak_kv-3.0.10/src/hashtree.erl#L1139-L1153

The 400K segments must be sent to the sink, using the Remote fun.  This is L1139.  In this case there are 400K segments to be sent.

The Remote fun is within the `riak_repl_aae_source` module:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L354-L378

For each of the 400K segments, it will call `async_get_segment/3`:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L420-L421

This will push the data to the gen_tcp socket to go over the wire to the peer on the source node cluster:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L739-L750

What should be noted here, is that each of these segment request messages is small on the wire (at 36 bytes), however, this still leads to a lot of data (13MB) being sent very quickly (as there is no other work to pause in this send process).

At the `riak_repl_aae_sink side`, the TCP socket is set to `{active, once}`.  That is to say, it will pull one message at a time off the wire, leaving any other messages in the buffer.  Each segment request will then be received by this function:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L141-L143

The exchange_segment function will need to fetch a list of Keys/Hashes under that segment.  In this case this will amount to about 90-100 keys.  This should be a quick operation of o(1) ms - although this is relatively slow when compared to the speed at which the segment requests are being sent over the wire from the `riak_repl_aae_source`.  Once the list has been fetched, it will be cast to a spawned process to be sent back to the `riak_repl_aae_sink`.  Note that each lits will be o(10)KB in size - much larger than the 36 byte request.

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L209-L213

Once the list of Keys/Hashes have been sent to the spawned process, then next segment request can be processed by the `riak_repl_aae_sink`.

This process though has a significant flaw.  It is hinted to in comments, but not resolved by the use of a second process for sending:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L181-L201

As the `riak_repl_aae_source` is sending the segment requests faster than they are being processed, eventually, with a large number of segment requests, the sending will be paused at and OS level, as the TCP rcv buffer fills up on the sink and the sink node reports `TCP ZeroWindow` signalling the sending socket to halt until the receive buffer has space.  From that point, at best the sending process will in effect slow to the pace of the processing by the `riak_repl_aae_sink` process - as further segment requests may only be sent once there is space on the buffer, and space is only created by `riak_repl_aae_sink` reading from that buffer.

Note though, at this stage, the `riak_repl_aae_sink` has begun to send large 10KB responses to the `riak_repl_aae_source` - and the `riak_repl_aae_source`  is not receiving those messages.  Only when all the segment requests are sent, will the hashtree process being to request from the Remote fun those `keys_hashes` which have been sent:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L367-L368

Until the `riak_repl_aae_source` has sent _all_ the requests, any responses the `riak_repl_aae_sink` sender has sent will have to sit in the network buffer.  These can be large responses, and soon the buffer will be full, and the source node will report back to the sink node `TCP ZeroWindow`.

After the `TCP ZeroWindow` has been sent, until all the segment requests have been sent by the `riak_repl_aae_source`, the sender on the sink side cannot successfully complete a `Transport:send(Socket, Msg)`.  With 400K segments being sent, the network buffers full, and 1ms required to handle each message in the `riak_repl_aae_sink` - this could take 5 to 10 minutes.

If an attempt to write to the socket is paused for 30s due to a lack of TCP Window space at the receiving node, without a Window update from the opposing side to indicate space - the connection is reset (assumed by the sink node to be broken).  This inevitably happens.  Both the `riak_repl_aae_source` and `riak_repl_aae_sink` fail due to the TCP connection being closed.

No actual repairs have happened at this stage, as no Keys/Hashes have been read at the `riak_repl_aae_source`.  So the delta is not reduced, and any attempts to re-run the sync also fail with the same issue, at the same point.  full-sync is permanently broken between those clusters unless the delta can be resolved through other means. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

riak_repl_aae full-sync deadlock #817

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

riak_repl_aae full-sync deadlock #817

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions