Soft retry limits - AAE fullsync

When performing AAE full-sync, if an elected vnode to co-ordinate with is doing a tree rebuild (and given multi-hour reload times this would seem to be a common event), the co-ordination attempt will soft exit.

The soft exit is captured and handled here:

https://github.com/basho/riak_repl/blob/2.1.8/src/riak_repl2_fscoordinator.erl#L537-L554

This checks against a retry limit, and presumably prompts a retry if the limit is not reached.  The default soft retry limit is set to infinity.

In production we see very rapid retries, which escalates the sys process count, and ultimately impacts stability of the cluster.

Attempts are underway to control this behaviour by setting a max retry limit.  However the default setting appears to be unsafe and should probably be changed.  Perhaps also there should be a wait between retries.  Perhaps also there is an underlying issue causing the process count to stack up with the reties.

![sys_process_count](https://user-images.githubusercontent.com/1628897/31717136-aa4c82fa-b402-11e7-8e18-9d8c9ba6eca5.png)

![retry_logs](https://user-images.githubusercontent.com/1628897/31717146-b2beceb6-b402-11e7-80c5-f7d957f43832.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Soft retry limits - AAE fullsync #772

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Soft retry limits - AAE fullsync #772

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions