Skip to content

Soft retry limits - AAE fullsync #772

@martinsumner

Description

@martinsumner

When performing AAE full-sync, if an elected vnode to co-ordinate with is doing a tree rebuild (and given multi-hour reload times this would seem to be a common event), the co-ordination attempt will soft exit.

The soft exit is captured and handled here:

https://github.com/basho/riak_repl/blob/2.1.8/src/riak_repl2_fscoordinator.erl#L537-L554

This checks against a retry limit, and presumably prompts a retry if the limit is not reached. The default soft retry limit is set to infinity.

In production we see very rapid retries, which escalates the sys process count, and ultimately impacts stability of the cluster.

Attempts are underway to control this behaviour by setting a max retry limit. However the default setting appears to be unsafe and should probably be changed. Perhaps also there should be a wait between retries. Perhaps also there is an underlying issue causing the process count to stack up with the reties.

sys_process_count

retry_logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions