-
Notifications
You must be signed in to change notification settings - Fork 33
Description
When performing AAE full-sync, if an elected vnode to co-ordinate with is doing a tree rebuild (and given multi-hour reload times this would seem to be a common event), the co-ordination attempt will soft exit.
The soft exit is captured and handled here:
https://github.com/basho/riak_repl/blob/2.1.8/src/riak_repl2_fscoordinator.erl#L537-L554
This checks against a retry limit, and presumably prompts a retry if the limit is not reached. The default soft retry limit is set to infinity.
In production we see very rapid retries, which escalates the sys process count, and ultimately impacts stability of the cluster.
Attempts are underway to control this behaviour by setting a max retry limit. However the default setting appears to be unsafe and should probably be changed. Perhaps also there should be a wait between retries. Perhaps also there is an underlying issue causing the process count to stack up with the reties.