Bug Report: VReplication SwitchTraffic cancel fails due to not having topo locks #17620

mattlord · 2025-01-23T14:27:58Z

Overview of the Issue

When a SwitchTraffic command fails, we make a best effort attempt to cancel the work and revert all state back to what it was before the switch started.

Any errors encountered during the cancel work are logged and not returned to the client. This is because we have to execute a series of steps and we do NOT want to bail on the entire process due to a failure along the way (we want to revert as much as we can).

The cancel work for a workflow today will always fail because we create a new context for that work and the locks are lost between the original context and the new one:

vitess/go/vt/vtctl/workflow/traffic_switcher.go

Lines 1138 to 1153 in f2827f9

    
           // cancelMigration attempts to revert all changes made during the migration so that we can get back to the 
        
           // state when traffic switching (or reversing) was initiated. 
        
           func (ts *trafficSwitcher) cancelMigration(ctx context.Context, sm *StreamMigrator) { 
        
           	var err error 
        
           	if ctx.Err() != nil { 
        
           		// Even though we create a new context later on we still record any context error: 
        
           		// for forensics in case of failures. 
        
           		ts.Logger().Infof("In Cancel migration: original context invalid: %s", ctx.Err()) 
        
           	} 
        
           	// We create a new context while canceling the migration, so that we are independent of the original 
        
           	// context being cancelled prior to or during the cancel operation. 
        
           	cmTimeout := 60 * time.Second 
        
           	cmCtx, cmCancel := context.WithTimeout(context.Background(), cmTimeout) 
        
           	defer cmCancel()

The failure is logged:

E0123 08:51:21.627147   96215 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
keyspace product is not locked (no locksInfo)

And because we could not revert the denied tables (MoveTables) or shard access (Reshard), the keyspaces are NOT in an expected/healthy state.

Reproduction Steps

git checkout main && make build

cd examples/local

./101_initial_cluster.sh && ./201_customer_tablets.sh && ./202_move_tables.sh

alias vtctldclient='command vtctldclient --server=localhost:15999'

say "Run for loop in other shell"


# !!!!! In another shell
customer_primary_uid=$(vtctldclient GetTablets --keyspace customer --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)
for _ in {1..900}; do
    command mysql -u root --socket "${VTDATAROOT}/vt_0000000${customer_primary_uid}/mysql.sock" vt_customer -e "lock table customer read; select sleep(2); unlock tables"
done
# !!!!! In another shell


# Load data in the customer table
table_file="${VTDATAROOT}/vt_0000000100/data/vt_commerce/customer.ibd"
commerce_primary_uid=$(vtctldclient GetTablets --keyspace commerce --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)

# Generate 5MiB of initial data
size=$((5*1024*1024))
while [[ $(stat -f "%z" "${table_file}") -lt ${size} ]]; do
    command mysql -u root --socket "${VTDATAROOT}/vt_0000000${commerce_primary_uid}/mysql.sock" vt_commerce -e "insert into customer (customer_id, email) values (${RANDOM}*${RANDOM}, '${RANDOM}[email protected]')" 2> /dev/null
done

# Grow that to at least 2GiB
size=$((2*1024*1024*1024))
i=1
while [[ $(stat -f "%z" "${table_file}") -lt ${size} ]]; do
    command mysql -u root --socket "${VTDATAROOT}/vt_0000000${commerce_primary_uid}/mysql.sock" vt_commerce -e "insert into customer (email) select concat(${i}, email) from customer limit 5000000"
    let i=i+1
done

say "Full data load completed"


vtctldclient MoveTables --workflow commerce2customer --target-keyspace customer switchtraffic

grep "Cancel migration failed" ${VTDATAROOT}/tmp/*

The client command returns:

❯ vtctldclient MoveTables --workflow commerce2customer --target-keyspace customer switchtraffic
E0123 09:18:01.306665   52584 main.go:60] rpc error: code = Unknown desc = failed to sync up replication between the source and target: rpc error: code = DeadlineExceeded desc = context deadline exceeded

And the logs show:

❯ grep "Cancel migration failed" ${VTDATAROOT}/tmp/*
/opt/vtdataroot/tmp/vtctld.ERROR:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.INFO:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.WARNING:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.out:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.pslord.matt.log.ERROR.20250123-091359.32204:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.pslord.matt.log.INFO.20250123-091323.32204:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.pslord.matt.log.WARNING.20250123-091359.32204:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL

Note

This also demonstrates that the max vreplication transaction lag calculation for workflows is wrong as the actual lag is much more than 30 seconds in this test — which is why we cannot catch up and get the error above — but the workflow's max_v_replication_transaction_lag keeps getting reset to 0 at least once every 30 seconds due to the heartbeats getting through (which by default are sent at least once every 30 seconds, see --vreplication_heartbeat_update_interval), which updates the time_heartbeat field in the workflow and that is used in the calculation when it's greater than the transaction_timestamp:

vitess/go/vt/vtctl/workflow/workflows.go

Lines 449 to 478 in f2827f9

    
           // MaxVReplicationTransactionLag estimates the actual statement processing lag 
        
           // between the source and the target. If we are still processing source events it 
        
           // is the difference b/w current time and the timestamp of the last event. If 
        
           // heartbeats are more recent than the last event, then the lag is the time since 
        
           // the last heartbeat as there can be an actual event immediately after the 
        
           // heartbeat, but which has not yet been processed on the target. 
        
           // We don't allow switching during the copy phase, so in that case we just return 
        
           // a large lag. All timestamps are in seconds since epoch. 
        
           if rstream.TransactionTimestamp == nil { 
        
           	rstream.TransactionTimestamp = &vttimepb.Time{} 
        
           } 
        
           lastTransactionTime := rstream.TransactionTimestamp.Seconds 
        
           if rstream.TimeHeartbeat == nil { 
        
           	rstream.TimeHeartbeat = &vttimepb.Time{} 
        
           } 
        
           lastHeartbeatTime := rstream.TimeHeartbeat.Seconds 
        
           if stream.State == binlogdatapb.VReplicationWorkflowState_Copying.String() { 
        
           	meta.maxVReplicationTransactionLag = math.MaxInt64 
        
           } else { 
        
           	if lastTransactionTime == 0 /* no new events after copy */ || 
        
           		lastHeartbeatTime > lastTransactionTime /* no recent transactions, so all caught up */ { 
        
           		lastTransactionTime = lastHeartbeatTime 
        
           	} 
        
           	now := time.Now().Unix() /* seconds since epoch */ 
        
           	transactionReplicationLag := float64(now - lastTransactionTime) 
        
           	if transactionReplicationLag > meta.maxVReplicationTransactionLag { 
        
           		meta.maxVReplicationTransactionLag = transactionReplicationLag 
        
           	} 
        
           }

You can observe this if you run this in another shell while the test is running:

while true; do vtctldclient GetWorkflows customer --compact --include-logs=false | grep max_v_replication_transaction_lag; sleep 1; done

Binary Version

vtgate version Version: 22.0.0-SNAPSHOT (Git revision 5363f038ace51165afcc8357bc6e1c81ee52a612 branch 'main') built on Thu Jan 23 14:11:33 UTC 2025 by [email protected] using go1.23.5 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

The text was updated successfully, but these errors were encountered:

mattlord added Component: VReplication Type: Bug labels Jan 23, 2025

mattlord self-assigned this Jan 23, 2025

mattlord added this to VReplication Jan 23, 2025

github-project-automation bot moved this to Backlog in VReplication Jan 23, 2025

mattlord moved this from Backlog to In progress in VReplication Jan 23, 2025

mattlord linked a pull request Jan 23, 2025 that will close this issue

VReplication: Address SwitchTraffic bugs around replication lag and cancel on error #17616

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: VReplication SwitchTraffic cancel fails due to not having topo locks #17620

Bug Report: VReplication SwitchTraffic cancel fails due to not having topo locks #17620

mattlord commented Jan 23, 2025 •

edited

Loading

Bug Report: VReplication SwitchTraffic cancel fails due to not having topo locks #17620

Bug Report: VReplication SwitchTraffic cancel fails due to not having topo locks #17620

Comments

mattlord commented Jan 23, 2025 • edited Loading

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

mattlord commented Jan 23, 2025 •

edited

Loading