Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid causing deadlocks when copying rows on busy tables #1455

Merged
merged 6 commits into from
Oct 23, 2024

Conversation

arthurschreiber
Copy link
Member

@arthurschreiber arthurschreiber commented Oct 20, 2024

Description

We've noticed quite a lot of deadlocks happening when running a gh-ost migration on a relatively busy table (with lots of DML transactions spanning multiple rows).

We traced this down to the INSERT INTO ... (SELECT ...) query that's executed to copy rows from the original table to the ghost table. This query will lock rows using a S lock in the order of the unique key use for the migration (most likely that's the PRIMARY key). If other queries end up locking rows in a different order, it's fairly easy to run into deadlocks.

These deadlocks then cause either queries to be blocked, or to fail once the deadlock detection kicks in.

This pull request updates the row copy query to use for share nowait on the inner SELECT part of the query on MySQL 8. This will make the row copy process bailout early if it can't get the S locks instantly on the rows that it's trying to copy, preventing deadlocks from happening. If this happens, the regular retry mechanism in gh-ost will wait for a second, and then try this operation again.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

@arthurschreiber
Copy link
Member Author

I've noticed we run into deadlocks when we copy rows while there's high write load on the original table. I was able to trace this down to this statement.

The situation we're running into is essentially this:

t1: is deleting rows from the original table in batches
* holds X lock on row id = 1
* waits for X lock on row id = 2

t2: is running the `INSERT IGNORE INTO ... (SELECT ...)` statement with `lock in share mode`
* holds S lock on row id = 2
* waits for S lock on row id = 1

@timvaillancourt Have you seen this before? What's your take on the suggested change? I'll try running this on our test systems and see if there's any data consistency issues introduced (but I don't think there will be).

danieljoos
danieljoos previously approved these changes Oct 21, 2024
@arthurschreiber
Copy link
Member Author

This change actually makes no difference at all.

INSERT INTO ... (SELECT ...) always locks the rows of the SELECT part using SHARE locks. When using REPEATABLE_READ, it will also take gap locks, whereas in READ_COMMITTED it will not take gap locks.

Removing or adding the LOCK IN SHARE MODE thus makes no difference at all.

The proper way to prevent these deadlocks will be to use FOR SHARE NOWAIT. Using that option, the copy will give up instantly when it encounters a locked row. This will allow gh-ost to pause for a short moment, and then try to re-attempt the copy. I believe this would be more in line with the general philosophy of gh-ost to not interfere with production workloads, instead of causing deadlocks and blocked transactions (as it does currently).

For MySQL 5.7 (not sure if we want to support that?), an alternative would be to set the INNODB_LOCK_WAIT_TIMEOUT session variable to ensure that gh-ost bails out quickly when encountering locked rows.

@drogart
Copy link

drogart commented Oct 21, 2024

The proper way to prevent these deadlocks will be to use FOR SHARE NOWAIT. Using that option, the copy will give up instantly when it encounters a locked row. This will allow gh-ost to pause for a short moment, and then try to re-attempt the copy. I believe this would be more in line with the general philosophy of gh-ost to not interfere with production workloads, instead of causing deadlocks and blocked transactions (as it does currently).

Is it possible to increment a counter and emit it as a metric when this happens? It's probably useful to know when the nominal app workload is consistently blocking a gh-ost copy.

I wonder also if the use of the read lock is to ensure consistency for mysql instances not using read_committed or repeatable_read. Probably doesn't matter if it's implicitly done by innodb for an insert...select statement, so I guess I'm just trying to recreate what Shlomi was thinking so many years ago :).

For MySQL 5.7 (not sure if we want to support that?), an alternative would be to set the INNODB_LOCK_WAIT_TIMEOUT session variable to ensure that gh-ost bails out quickly when encountering locked rows.

What's the retry behavior in this failure case (and I guess the previous one when it quickly fails a NOWAIT lock)? I think we should probably use exponential backoff so we don't end up with a thundering herd problem due to row locking triggering excessive retries.

@arthurschreiber
Copy link
Member Author

@drogart Thanks for taking a look. There's some existing retry helper that's used in a bunch of places in gh-ost, but I don't remember if it does exponential backoff or just direct retries. But I agree that an exponential retry is the way to go here.

@arthurschreiber
Copy link
Member Author

Apparently we have both, one for exponential and one for non-exponential retries 👍

gh-ost/go/logic/migrator.go

Lines 157 to 178 in b34b86d

func (this *Migrator) retryOperationWithExponentialBackoff(operation func() error, notFatalHint ...bool) (err error) {
var interval int64
maxRetries := int(this.migrationContext.MaxRetries())
maxInterval := this.migrationContext.ExponentialBackoffMaxInterval
for i := 0; i < maxRetries; i++ {
newInterval := int64(math.Exp2(float64(i - 1)))
if newInterval <= maxInterval {
interval = newInterval
}
if i != 0 {
time.Sleep(time.Duration(interval) * time.Second)
}
err = operation()
if err == nil {
return nil
}
}
if len(notFatalHint) == 0 {
this.migrationContext.PanicAbort <- err
}
return err
}

gh-ost/go/logic/migrator.go

Lines 133 to 150 in b34b86d

func (this *Migrator) retryOperation(operation func() error, notFatalHint ...bool) (err error) {
maxRetries := int(this.migrationContext.MaxRetries())
for i := 0; i < maxRetries; i++ {
if i != 0 {
// sleep after previous iteration
time.Sleep(1 * time.Second)
}
err = operation()
if err == nil {
return nil
}
// there's an error. Let's try again.
}
if len(notFatalHint) == 0 {
this.migrationContext.PanicAbort <- err
}
return err
}

@arthurschreiber arthurschreiber changed the title Do not lock rows when copying. Avoid causing deadlocks when copying rows on active tables Oct 23, 2024
@arthurschreiber arthurschreiber changed the title Avoid causing deadlocks when copying rows on active tables Avoid causing deadlocks when copying rows on busy tables Oct 23, 2024
@arthurschreiber
Copy link
Member Author

I updated the implementation to only add the NOWAIT clause if we're running on MySQL 8. I also updated the PR description (and added a note about retries).

@arthurschreiber
Copy link
Member Author

What's the retry behavior in this failure case (and I guess the previous one when it quickly fails a NOWAIT lock)? I think we should probably use exponential backoff so we don't end up with a thundering herd problem due to row locking triggering excessive retries.

I opted not to modify the existing retry behaviour. It's not using an exponential backoff, but it's waiting for a second between retrying, and retrying up to 60 times I think. There's no risk of a thundering herd problem here, because there's only a single row copy process running and the 1 second pause is actually a fairly long duration.

@arthurschreiber arthurschreiber merged commit 30f28c2 into master Oct 23, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants