Skip to content

Skip a partition replication due to race condition on fullsync [JIRA: RIAK-2551] #743

@ksauzz

Description

@ksauzz

There is a race condition on riak_repl_keylist_server:bloom_fold of fullsync, which can throw a function_clause error, but full_sync manager treats this as a normal finish of the partition replication. So a user cannot notice all keys could be not replicated to the sink cluster even if a partition could be skipped to be replicated.

The Cause

In keylisting fullsync, bloom_fold as fold function on vnode worker waits for resume_pause after sending a batch to the sink node. But somehow other fold message was received at this waiting worker. (See last message in crash.log) Then, the branch works, ?TRACE macro returns ok atom as a accumulator which causes vnode worker's crash.

Reproduction Steps

Couldn't find it.

Occurrence Frequency

Sometimes this have been observed by a customer. For them, it looks this happens randomly.

error.log

2016-04-07 23:22:01.094 [error] <0.9942.66> gen_server <0.9942.66> terminated with reason: no function clause matching riak_repl_keylist_server:bloom_fold({<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<121,107,250,67,143,8,73,...>>}, <<53,1,0,0,0,34,131,108,0,0,0,1,104,2,109,0,0,0,8,239,219,125,147,226,32,130,220,104,2,97,1,110,...>>, ok) line 675
2016-04-07 23:22:01.111 [error] <0.9942.66> CRASH REPORT Process <0.9942.66> with 0 neighbours exited with reason: no function clause matching riak_repl_keylist_server:bloom_fold({<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<121,107,250,67,143,8,73,...>>}, <<53,1,0,0,0,34,131,108,0,0,0,1,104,2,109,0,0,0,8,239,219,125,147,226,32,130,220,104,2,97,1,110,...>>, ok) line 675 in gen_server:terminate/6 line 744
2016-04-07 23:22:01.111 [error] <0.1131.0> Supervisor {<0.1131.0>,poolboy_sup} had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at <0.9942.66> exit with reason no function clause matching riak_repl_keylist_server:bloom_fold({<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<121,107,250,67,143,8,73,...>>}, <<53,1,0,0,0,34,131,108,0,0,0,1,104,2,109,0,0,0,8,239,219,125,147,226,32,130,220,104,2,97,1,110,...>>, ok) line 675 in context child_terminated

crash.log

binary data was replaced as <<"ommited binary">>

2016-04-13 10:40:31 =ERROR REPORT====
** Generic server <0.916.0> terminating 
** Last message in was {'$gen_cast',{work,{fold,#Fun<riak_cs_kv_multi_backend.9.110104299>,#Fun<riak_kv_vnode.35.88487897>},{raw,#Ref<0.0.4.162629>,<0.22168.4>},<0.884.0>}}
** When Server state == {state,riak_kv_worker,{state,1118962191081472546749696200048404186924073353216}}
** Reason for termination == 
** {function_clause,[{riak_repl_keylist_server,bloom_fold,[{<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<35,217,96,165,236,160,69,118,179,100,39,110,92,174,247,160,0,0,0,0>>},<<"omitted binary">>,ok],[{file,"src/riak_repl_keylist_server.erl"},{line,675}]},{bitcask_fileops,fold_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,554}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold,3,[{file,"src/bitcask_fileops.erl"},{line,391}]},{bitcask,subfold,3,[{file,"src/bitcask.erl"},{line,506}]},{bitcask_nifs,keydir_frozen,4,[{file,"src/bitcask_nifs.erl"},{line,304}]},{riak_kv_bitcask_backend,'-fold_objects/4-fun-0-',5,[{file,"src/riak_kv_bitcask_backend.erl"},{line,351}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]}]}
2016-04-13 10:40:31 =CRASH REPORT====
  crasher:
    initial call: riak_core_vnode_worker:init/1
    pid: <0.916.0>
    registered_name: []
    exception exit: {{function_clause,[{riak_repl_keylist_server,bloom_fold,[{<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<35,217,96,165,236,160,69,118,179,100,39,110,92,174,247,160,0,0,0,0>>},<<"omitted binary">>,ok],[{file,"src/riak_repl_keylist_server.erl"},{line,675}]},{bitcask_fileops,fold_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,554}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold,3,[{file,"src/bitcask_fileops.erl"},{line,391}]},{bitcask,subfold,3,[{file,"src/bitcask.erl"},{line,506}]},{bitcask_nifs,keydir_frozen,4,[{file,"src/bitcask_nifs.erl"},{line,304}]},{riak_kv_bitcask_backend,'-fold_objects/4-fun-0-',5,[{file,"src/riak_kv_bitcask_backend.erl"},{line,351}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]}]},[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,744}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
    ancestors: [<0.887.0>,<0.885.0>,<0.884.0>,<0.680.0>,riak_core_vnode_sup,riak_core_sup,<0.220.0>]
    messages: [bloom_resume]
    links: [<0.887.0>,<0.885.0>]
    dictionary: [{bitcask_file_mod,bitcask_file},{bitcask_time_fudge,no_testing}]
    trap_exit: false
    status: running
    heap_size: 6772
    stack_size: 27
    reductions: 20747615
  neighbours:
2016-04-13 10:40:31 =SUPERVISOR REPORT====
     Supervisor: {<0.887.0>,poolboy_sup}
     Context:    child_terminated
     Reason:     {function_clause,[{riak_repl_keylist_server,bloom_fold,[{<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<35,217,96,165,236,160,69,118,179,100,39,110,92,174,247,160,0,0,0,0>>},<<"omitted binary">>,ok],[{file,"src/riak_repl_keylist_server.erl"},{line,675}]},{bitcask_fileops,fold_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,554}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold,3,[{file,"src/bitcask_fileops.erl"},{line,391}]},{bitcask,subfold,3,[{file,"src/bitcask.erl"},{line,506}]},{bitcask_nifs,keydir_frozen,4,[{file,"src/bitcask_nifs.erl"},{line,304}]},{riak_kv_bitcask_backend,'-fold_objects/4-fun-0-',5,[{file,"src/riak_kv_bitcask_backend.erl"},{line,351}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]}]}
     Offender:   [{pid,<0.916.0>},{name,riak_core_vnode_worker},{mfargs,{riak_core_vnode_worker,start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions