Automatically rebalancing connections of realtime repl after recovering node failure. [JIRA: RIAK-2380] #724

ksauzz · 2016-02-12T03:11:22Z

On realtime replication, connections from a source cluster re-establishes to another healthy node after a node downs in a sink cluster. However, no node will re-connect to the node even through the node recovers. This causes imbalance load by realtime replication in sink cluster.

A workaround is to restart realtime replication by riak-repl realtime stop/start <clustername>, but it would be better to rebalance connections automatically if a failure node come back, or new node is added to sink cluster.

related to #350

The text was updated successfully, but these errors were encountered:

jonmeredith · 2016-02-12T15:45:22Z

Hi Kaz,

Realtime replication should rebalance automatically - did it not work? It was added around 2.0.3 as PR #651

Unfortunately there aren't many log messages to see if it is working - we should at least modify it to let us know when it is reconnecting and why. The only want to know is if you see

            lager:info("Established realtime connection to site ~p address ~s",
                       [Remote, peername(State2)]),

From the same Pid in the logs as the realtime process is re-used.

ksauzz · 2016-02-13T03:09:50Z

I might have misunderstood about it. Please let me check again....

ksauzz · 2016-03-25T09:21:53Z

I checked the behavior and the implementation of rebalancing a connection to the sink node. I confirmed the reconnecting started with some delay after ring_update was triggered. That works fine! 😊
Thank you.

ksauzz · 2016-03-25T09:44:29Z

Sorry for the reopening. It seems like restarting a sink node doesn't trigger ring_update on a source node. I misunderstood the list of stopped nodes are transferred to the source cluster. But it looks like ring data doesn't have such data as metadata.

In the following case, the rebalancing doesn't start.

make a source cluster (dev1, dev2, dev3) and a sink cluster (dev4, dev5, dev6)
enable and start realtime replication
wait for finish all handoff
stop dev4 and dev5
make sure all source nodes connect to dev6
start dev4 and dev5

I guess another trigger is needed to start the rebalancing in addition to ring_update.

Basho-JIRA changed the title ~~Automatically rebalancing connections of realtime repl after recovering node failure.~~ Automatically rebalancing connections of realtime repl after recovering node failure. [JIRA: RIAK-2380] Feb 12, 2016

Basho-JIRA added the JIRA: To Do label Feb 12, 2016

ksauzz closed this as completed Mar 25, 2016

ksauzz reopened this Mar 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically rebalancing connections of realtime repl after recovering node failure. [JIRA: RIAK-2380] #724

Automatically rebalancing connections of realtime repl after recovering node failure. [JIRA: RIAK-2380] #724

ksauzz commented Feb 12, 2016

jonmeredith commented Feb 12, 2016

ksauzz commented Feb 13, 2016

ksauzz commented Mar 25, 2016

ksauzz commented Mar 25, 2016

Automatically rebalancing connections of realtime repl after recovering node failure. [JIRA: RIAK-2380] #724

Automatically rebalancing connections of realtime repl after recovering node failure. [JIRA: RIAK-2380] #724

Comments

ksauzz commented Feb 12, 2016

jonmeredith commented Feb 12, 2016

ksauzz commented Feb 13, 2016

ksauzz commented Mar 25, 2016

ksauzz commented Mar 25, 2016