Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: VReplication SwitchTraffic cancel fails due to not having topo locks #17620

Open
mattlord opened this issue Jan 23, 2025 · 0 comments · May be fixed by #17616
Open

Bug Report: VReplication SwitchTraffic cancel fails due to not having topo locks #17620

mattlord opened this issue Jan 23, 2025 · 0 comments · May be fixed by #17616

Comments

@mattlord
Copy link
Contributor

mattlord commented Jan 23, 2025

Overview of the Issue

When a SwitchTraffic command fails, we make a best effort attempt to cancel the work and revert all state back to what it was before the switch started.

Any errors encountered during the cancel work are logged and not returned to the client. This is because we have to execute a series of steps and we do NOT want to bail on the entire process due to a failure along the way (we want to revert as much as we can).

The cancel work for a workflow today will always fail because we create a new context for that work and the locks are lost between the original context and the new one:

// cancelMigration attempts to revert all changes made during the migration so that we can get back to the
// state when traffic switching (or reversing) was initiated.
func (ts *trafficSwitcher) cancelMigration(ctx context.Context, sm *StreamMigrator) {
var err error
if ctx.Err() != nil {
// Even though we create a new context later on we still record any context error:
// for forensics in case of failures.
ts.Logger().Infof("In Cancel migration: original context invalid: %s", ctx.Err())
}
// We create a new context while canceling the migration, so that we are independent of the original
// context being cancelled prior to or during the cancel operation.
cmTimeout := 60 * time.Second
cmCtx, cmCancel := context.WithTimeout(context.Background(), cmTimeout)
defer cmCancel()

The failure is logged:

E0123 08:51:21.627147   96215 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
keyspace product is not locked (no locksInfo)

And because we could not revert the denied tables (MoveTables) or shard access (Reshard), the keyspaces are NOT in an expected/healthy state.

Reproduction Steps

git checkout main && make build

cd examples/local

./101_initial_cluster.sh && ./201_customer_tablets.sh && ./202_move_tables.sh

alias vtctldclient='command vtctldclient --server=localhost:15999'

say "Run for loop in other shell"


# !!!!! In another shell
customer_primary_uid=$(vtctldclient GetTablets --keyspace customer --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)
for _ in {1..900}; do
    command mysql -u root --socket "${VTDATAROOT}/vt_0000000${customer_primary_uid}/mysql.sock" vt_customer -e "lock table customer read; select sleep(2); unlock tables"
done
# !!!!! In another shell


# Load data in the customer table
table_file="${VTDATAROOT}/vt_0000000100/data/vt_commerce/customer.ibd"
commerce_primary_uid=$(vtctldclient GetTablets --keyspace commerce --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)

# Generate 5MiB of initial data
size=$((5*1024*1024))
while [[ $(stat -f "%z" "${table_file}") -lt ${size} ]]; do
    command mysql -u root --socket "${VTDATAROOT}/vt_0000000${commerce_primary_uid}/mysql.sock" vt_commerce -e "insert into customer (customer_id, email) values (${RANDOM}*${RANDOM}, '${RANDOM}[email protected]')" 2> /dev/null
done

# Grow that to at least 2GiB
size=$((2*1024*1024*1024))
i=1
while [[ $(stat -f "%z" "${table_file}") -lt ${size} ]]; do
    command mysql -u root --socket "${VTDATAROOT}/vt_0000000${commerce_primary_uid}/mysql.sock" vt_commerce -e "insert into customer (email) select concat(${i}, email) from customer limit 5000000"
    let i=i+1
done

say "Full data load completed"


vtctldclient MoveTables --workflow commerce2customer --target-keyspace customer switchtraffic

grep "Cancel migration failed" ${VTDATAROOT}/tmp/*

The client command returns:

❯ vtctldclient MoveTables --workflow commerce2customer --target-keyspace customer switchtraffic
E0123 09:18:01.306665   52584 main.go:60] rpc error: code = Unknown desc = failed to sync up replication between the source and target: rpc error: code = DeadlineExceeded desc = context deadline exceeded

And the logs show:

❯ grep "Cancel migration failed" ${VTDATAROOT}/tmp/*
/opt/vtdataroot/tmp/vtctld.ERROR:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.INFO:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.WARNING:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.out:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.pslord.matt.log.ERROR.20250123-091359.32204:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.pslord.matt.log.INFO.20250123-091323.32204:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL
/opt/vtdataroot/tmp/vtctld.pslord.matt.log.WARNING.20250123-091359.32204:E0123 09:18:00.811819   32204 traffic_switcher.go:1161] Cancel migration failed: could not revert denied tables / shard access: Code: INTERNAL

Note

This also demonstrates that the max vreplication transaction lag calculation for workflows is wrong as the actual lag is much more than 30 seconds in this test — which is why we cannot catch up and get the error above — but the workflow's max_v_replication_transaction_lag keeps getting reset to 0 at least once every 30 seconds due to the heartbeats getting through (which by default are sent at least once every 30 seconds, see --vreplication_heartbeat_update_interval), which updates the time_heartbeat field in the workflow and that is used in the calculation when it's greater than the transaction_timestamp:

// MaxVReplicationTransactionLag estimates the actual statement processing lag
// between the source and the target. If we are still processing source events it
// is the difference b/w current time and the timestamp of the last event. If
// heartbeats are more recent than the last event, then the lag is the time since
// the last heartbeat as there can be an actual event immediately after the
// heartbeat, but which has not yet been processed on the target.
// We don't allow switching during the copy phase, so in that case we just return
// a large lag. All timestamps are in seconds since epoch.
if rstream.TransactionTimestamp == nil {
rstream.TransactionTimestamp = &vttimepb.Time{}
}
lastTransactionTime := rstream.TransactionTimestamp.Seconds
if rstream.TimeHeartbeat == nil {
rstream.TimeHeartbeat = &vttimepb.Time{}
}
lastHeartbeatTime := rstream.TimeHeartbeat.Seconds
if stream.State == binlogdatapb.VReplicationWorkflowState_Copying.String() {
meta.maxVReplicationTransactionLag = math.MaxInt64
} else {
if lastTransactionTime == 0 /* no new events after copy */ ||
lastHeartbeatTime > lastTransactionTime /* no recent transactions, so all caught up */ {
lastTransactionTime = lastHeartbeatTime
}
now := time.Now().Unix() /* seconds since epoch */
transactionReplicationLag := float64(now - lastTransactionTime)
if transactionReplicationLag > meta.maxVReplicationTransactionLag {
meta.maxVReplicationTransactionLag = transactionReplicationLag
}
}

You can observe this if you run this in another shell while the test is running: while true; do vtctldclient GetWorkflows customer --compact --include-logs=false | grep max_v_replication_transaction_lag; sleep 1; done

Binary Version

vtgate version Version: 22.0.0-SNAPSHOT (Git revision 5363f038ace51165afcc8357bc6e1c81ee52a612 branch 'main') built on Thu Jan 23 14:11:33 UTC 2025 by [email protected] using go1.23.5 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
1 participant