You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a SwitchTraffic command fails, we make a best effort attempt to cancel the work and revert all state back to what it was before the switch started.
Any errors encountered during the cancel work are logged and not returned to the client. This is because we have to execute a series of steps and we do NOT want to bail on the entire process due to a failure along the way (we want to revert as much as we can).
The cancel work for a workflow today will always fail because we create a new context for that work and the locks are lost between the original context and the new one:
E0123 08:51:21.627147 96215 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
keyspace product is not locked (no locksInfo)
And because we could not revert the denied tables (MoveTables) or shard access (Reshard), the keyspaces are NOT in an expected/healthy state.
Reproduction Steps
git checkout main && make build
cd examples/local
./101_initial_cluster.sh && ./201_customer_tablets.sh && ./202_move_tables.sh
alias vtctldclient='command vtctldclient --server=localhost:15999'
say "Run for loop in other shell"
# !!!!! In another shell
customer_primary_uid=$(vtctldclient GetTablets --keyspace customer --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)
for _ in {1..900}; do
command mysql -u root --socket "${VTDATAROOT}/vt_0000000${customer_primary_uid}/mysql.sock" vt_customer -e "lock table customer read; select sleep(2); unlock tables"
done
# !!!!! In another shell
# Load data in the customer table
table_file="${VTDATAROOT}/vt_0000000100/data/vt_commerce/customer.ibd"
commerce_primary_uid=$(vtctldclient GetTablets --keyspace commerce --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)
# Generate 5MiB of initial data
size=$((5*1024*1024))
while [[ $(stat -f "%z" "${table_file}") -lt ${size} ]]; do
command mysql -u root --socket "${VTDATAROOT}/vt_0000000${commerce_primary_uid}/mysql.sock" vt_commerce -e "insert into customer (customer_id, email) values (${RANDOM}*${RANDOM}, '${RANDOM}[email protected]')" 2> /dev/null
done
# Grow that to at least 2GiB
size=$((2*1024*1024*1024))
i=1
while [[ $(stat -f "%z" "${table_file}") -lt ${size} ]]; do
command mysql -u root --socket "${VTDATAROOT}/vt_0000000${commerce_primary_uid}/mysql.sock" vt_commerce -e "insert into customer (email) select concat(${i}, email) from customer limit 5000000"
let i=i+1
done
say "Full data load completed"
vtctldclient MoveTables --workflow commerce2customer --target-keyspace customer switchtraffic
grep "Cancel migration failed" ${VTDATAROOT}/tmp/*
The client command returns:
❯ vtctldclient MoveTables --workflow commerce2customer --target-keyspace customer switchtraffic
E0123 09:18:01.306665 52584 main.go:60] rpc error: code = Unknown desc = failed to sync up replication between the source and target: rpc error: code = DeadlineExceeded desc = context deadline exceeded
And the logs show:
❯ grep "Cancel migration failed" ${VTDATAROOT}/tmp/*
/opt/vtdataroot/tmp/vtctld.ERROR:E0123 09:18:00.811819 32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.INFO:E0123 09:18:00.811819 32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.WARNING:E0123 09:18:00.811819 32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.out:E0123 09:18:00.811819 32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.pslord.matt.log.ERROR.20250123-091359.32204:E0123 09:18:00.811819 32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.pslord.matt.log.INFO.20250123-091323.32204:E0123 09:18:00.811819 32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.pslord.matt.log.WARNING.20250123-091359.32204:E0123 09:18:00.811819 32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
Note
This also demonstrates that the max vreplication transaction lag calculation for workflows is wrong as the actual lag is much more than 30 seconds in this test — which is why we cannot catch up and get the error above — but the workflow's max_v_replication_transaction_lag keeps getting reset to 0 at least once every 30 seconds due to the heartbeats getting through (which by default are sent at least once every 30 seconds, see --vreplication_heartbeat_update_interval), which updates the time_heartbeat field in the workflow and that is used in the calculation when it's greater than the transaction_timestamp:
You can observe this if you run this in another shell while the test is running: while true; do vtctldclient GetWorkflows customer --compact --include-logs=false | grep max_v_replication_transaction_lag; sleep 1; done
Binary Version
vtgate version Version: 22.0.0-SNAPSHOT (Git revision 5363f038ace51165afcc8357bc6e1c81ee52a612 branch 'main') built on Thu Jan 23 14:11:33 UTC 2025 by [email protected] using go1.23.5 darwin/arm64
Operating System and Environment details
N/A
Log Fragments
The text was updated successfully, but these errors were encountered:
Overview of the Issue
When a
SwitchTraffic
command fails, we make a best effort attempt to cancel the work and revert all state back to what it was before the switch started.Any errors encountered during the cancel work are logged and not returned to the client. This is because we have to execute a series of steps and we do NOT want to bail on the entire process due to a failure along the way (we want to revert as much as we can).
The cancel work for a workflow today will always fail because we create a new context for that work and the locks are lost between the original context and the new one:
vitess/go/vt/vtctl/workflow/traffic_switcher.go
Lines 1138 to 1153 in f2827f9
The failure is logged:
And because we could not revert the denied tables (MoveTables) or shard access (Reshard), the keyspaces are NOT in an expected/healthy state.
Reproduction Steps
The client command returns:
And the logs show:
Note
This also demonstrates that the max vreplication transaction lag calculation for workflows is wrong as the actual lag is much more than 30 seconds in this test — which is why we cannot catch up and get the error above — but the workflow's
max_v_replication_transaction_lag
keeps getting reset to 0 at least once every 30 seconds due to the heartbeats getting through (which by default are sent at least once every 30 seconds, see--vreplication_heartbeat_update_interval
), which updates thetime_heartbeat
field in the workflow and that is used in the calculation when it's greater than the transaction_timestamp:vitess/go/vt/vtctl/workflow/workflows.go
Lines 449 to 478 in f2827f9
You can observe this if you run this in another shell while the test is running:
while true; do vtctldclient GetWorkflows customer --compact --include-logs=false | grep max_v_replication_transaction_lag; sleep 1; done
Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: