Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically rebalancing connections of realtime repl after recovering node failure. [JIRA: RIAK-2380] #724

Open
ksauzz opened this issue Feb 12, 2016 · 4 comments

Comments

@ksauzz
Copy link

ksauzz commented Feb 12, 2016

On realtime replication, connections from a source cluster re-establishes to another healthy node after a node downs in a sink cluster. However, no node will re-connect to the node even through the node recovers. This causes imbalance load by realtime replication in sink cluster.

A workaround is to restart realtime replication by riak-repl realtime stop/start <clustername>, but it would be better to rebalance connections automatically if a failure node come back, or new node is added to sink cluster.

related to #350

@Basho-JIRA Basho-JIRA changed the title Automatically rebalancing connections of realtime repl after recovering node failure. Automatically rebalancing connections of realtime repl after recovering node failure. [JIRA: RIAK-2380] Feb 12, 2016
@jonmeredith
Copy link
Contributor

Hi Kaz,

Realtime replication should rebalance automatically - did it not work? It was added around 2.0.3 as PR #651

Unfortunately there aren't many log messages to see if it is working - we should at least modify it to let us know when it is reconnecting and why. The only want to know is if you see

            lager:info("Established realtime connection to site ~p address ~s",
                       [Remote, peername(State2)]),

From the same Pid in the logs as the realtime process is re-used.

@ksauzz
Copy link
Author

ksauzz commented Feb 13, 2016

I might have misunderstood about it. Please let me check again....

@ksauzz
Copy link
Author

ksauzz commented Mar 25, 2016

I checked the behavior and the implementation of rebalancing a connection to the sink node. I confirmed the reconnecting started with some delay after ring_update was triggered. That works fine! 😊
Thank you.

@ksauzz ksauzz closed this as completed Mar 25, 2016
@ksauzz
Copy link
Author

ksauzz commented Mar 25, 2016

Sorry for the reopening. It seems like restarting a sink node doesn't trigger ring_update on a source node. I misunderstood the list of stopped nodes are transferred to the source cluster. But it looks like ring data doesn't have such data as metadata.

In the following case, the rebalancing doesn't start.

  1. make a source cluster (dev1, dev2, dev3) and a sink cluster (dev4, dev5, dev6)
  2. enable and start realtime replication
  3. wait for finish all handoff
  4. stop dev4 and dev5
  5. make sure all source nodes connect to dev6
  6. start dev4 and dev5

I guess another trigger is needed to start the rebalancing in addition to ring_update.

@ksauzz ksauzz reopened this Mar 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants