Skip to content

riak_repl_aae full-sync deadlock #817

@martinsumner

Description

@martinsumner

When running full-sync between a source and sink, problems can occur where both the clusters have a large number of keys, but also there is a large delta between the clusters.

Scenario:

Total Keys 20bn
Ring Size: 256
Delta: 100m difference

If an attempt to repair a partition is made, of the 1M segments in the AAE tree, about 400K segments will be in a delta state.

The aae_exchange will have to use this code to repair:

https://github.com/basho/riak_core/blob/riak_kv-3.0.10/src/hashtree.erl#L1139-L1153

The 400K segments must be sent to the sink, using the Remote fun. This is L1139. In this case there are 400K segments to be sent.

The Remote fun is within the riak_repl_aae_source module:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L354-L378

For each of the 400K segments, it will call async_get_segment/3:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L420-L421

This will push the data to the gen_tcp socket to go over the wire to the peer on the source node cluster:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L739-L750

What should be noted here, is that each of these segment request messages is small on the wire (at 36 bytes), however, this still leads to a lot of data (13MB) being sent very quickly (as there is no other work to pause in this send process).

At the riak_repl_aae_sink side, the TCP socket is set to {active, once}. That is to say, it will pull one message at a time off the wire, leaving any other messages in the buffer. Each segment request will then be received by this function:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L141-L143

The exchange_segment function will need to fetch a list of Keys/Hashes under that segment. In this case this will amount to about 90-100 keys. This should be a quick operation of o(1) ms - although this is relatively slow when compared to the speed at which the segment requests are being sent over the wire from the riak_repl_aae_source. Once the list has been fetched, it will be cast to a spawned process to be sent back to the riak_repl_aae_sink. Note that each lits will be o(10)KB in size - much larger than the 36 byte request.

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L209-L213

Once the list of Keys/Hashes have been sent to the spawned process, then next segment request can be processed by the riak_repl_aae_sink.

This process though has a significant flaw. It is hinted to in comments, but not resolved by the use of a second process for sending:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L181-L201

As the riak_repl_aae_source is sending the segment requests faster than they are being processed, eventually, with a large number of segment requests, the sending will be paused at and OS level, as the TCP rcv buffer fills up on the sink and the sink node reports TCP ZeroWindow signalling the sending socket to halt until the receive buffer has space. From that point, at best the sending process will in effect slow to the pace of the processing by the riak_repl_aae_sink process - as further segment requests may only be sent once there is space on the buffer, and space is only created by riak_repl_aae_sink reading from that buffer.

Note though, at this stage, the riak_repl_aae_sink has begun to send large 10KB responses to the riak_repl_aae_source - and the riak_repl_aae_source is not receiving those messages. Only when all the segment requests are sent, will the hashtree process being to request from the Remote fun those keys_hashes which have been sent:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L367-L368

Until the riak_repl_aae_source has sent all the requests, any responses the riak_repl_aae_sink sender has sent will have to sit in the network buffer. These can be large responses, and soon the buffer will be full, and the source node will report back to the sink node TCP ZeroWindow.

After the TCP ZeroWindow has been sent, until all the segment requests have been sent by the riak_repl_aae_source, the sender on the sink side cannot successfully complete a Transport:send(Socket, Msg). With 400K segments being sent, the network buffers full, and 1ms required to handle each message in the riak_repl_aae_sink - this could take 5 to 10 minutes.

If an attempt to write to the socket is paused for 30s due to a lack of TCP Window space at the receiving node, without a Window update from the opposing side to indicate space - the connection is reset (assumed by the sink node to be broken). This inevitably happens. Both the riak_repl_aae_source and riak_repl_aae_sink fail due to the TCP connection being closed.

No actual repairs have happened at this stage, as no Keys/Hashes have been read at the riak_repl_aae_source. So the delta is not reduced, and any attempts to re-run the sync also fail with the same issue, at the same point. full-sync is permanently broken between those clusters unless the delta can be resolved through other means.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions