Avoid causing deadlocks when copying rows on busy tables #1455

arthurschreiber · 2024-10-20T10:06:35Z

Description

We've noticed quite a lot of deadlocks happening when running a gh-ost migration on a relatively busy table (with lots of DML transactions spanning multiple rows).

We traced this down to the INSERT INTO ... (SELECT ...) query that's executed to copy rows from the original table to the ghost table. This query will lock rows using a S lock in the order of the unique key use for the migration (most likely that's the PRIMARY key). If other queries end up locking rows in a different order, it's fairly easy to run into deadlocks.

These deadlocks then cause either queries to be blocked, or to fail once the deadlock detection kicks in.

This pull request updates the row copy query to use for share nowait on the inner SELECT part of the query on MySQL 8. This will make the row copy process bailout early if it can't get the S locks instantly on the rows that it's trying to copy, preventing deadlocks from happening. If this happens, the regular retry mechanism in gh-ost will wait for a second, and then try this operation again.

In case this PR introduced Go code changes:

contributed code is using same conventions as original code
script/cibuild returns with no formatting errors, build errors or unit test errors.

arthurschreiber · 2024-10-20T10:25:10Z

I've noticed we run into deadlocks when we copy rows while there's high write load on the original table. I was able to trace this down to this statement.

The situation we're running into is essentially this:

t1: is deleting rows from the original table in batches
* holds X lock on row id = 1
* waits for X lock on row id = 2

t2: is running the `INSERT IGNORE INTO ... (SELECT ...)` statement with `lock in share mode`
* holds S lock on row id = 2
* waits for S lock on row id = 1

@timvaillancourt Have you seen this before? What's your take on the suggested change? I'll try running this on our test systems and see if there's any data consistency issues introduced (but I don't think there will be).

arthurschreiber · 2024-10-21T12:46:46Z

This change actually makes no difference at all.

INSERT INTO ... (SELECT ...) always locks the rows of the SELECT part using SHARE locks. When using REPEATABLE_READ, it will also take gap locks, whereas in READ_COMMITTED it will not take gap locks.

Removing or adding the LOCK IN SHARE MODE thus makes no difference at all.

The proper way to prevent these deadlocks will be to use FOR SHARE NOWAIT. Using that option, the copy will give up instantly when it encounters a locked row. This will allow gh-ost to pause for a short moment, and then try to re-attempt the copy. I believe this would be more in line with the general philosophy of gh-ost to not interfere with production workloads, instead of causing deadlocks and blocked transactions (as it does currently).

For MySQL 5.7 (not sure if we want to support that?), an alternative would be to set the INNODB_LOCK_WAIT_TIMEOUT session variable to ensure that gh-ost bails out quickly when encountering locked rows.

drogart · 2024-10-21T14:38:19Z

The proper way to prevent these deadlocks will be to use FOR SHARE NOWAIT. Using that option, the copy will give up instantly when it encounters a locked row. This will allow gh-ost to pause for a short moment, and then try to re-attempt the copy. I believe this would be more in line with the general philosophy of gh-ost to not interfere with production workloads, instead of causing deadlocks and blocked transactions (as it does currently).

Is it possible to increment a counter and emit it as a metric when this happens? It's probably useful to know when the nominal app workload is consistently blocking a gh-ost copy.

I wonder also if the use of the read lock is to ensure consistency for mysql instances not using read_committed or repeatable_read. Probably doesn't matter if it's implicitly done by innodb for an insert...select statement, so I guess I'm just trying to recreate what Shlomi was thinking so many years ago :).

For MySQL 5.7 (not sure if we want to support that?), an alternative would be to set the INNODB_LOCK_WAIT_TIMEOUT session variable to ensure that gh-ost bails out quickly when encountering locked rows.

What's the retry behavior in this failure case (and I guess the previous one when it quickly fails a NOWAIT lock)? I think we should probably use exponential backoff so we don't end up with a thundering herd problem due to row locking triggering excessive retries.

arthurschreiber · 2024-10-21T15:46:53Z

@drogart Thanks for taking a look. There's some existing retry helper that's used in a bunch of places in gh-ost, but I don't remember if it does exponential backoff or just direct retries. But I agree that an exponential retry is the way to go here.

arthurschreiber · 2024-10-21T16:06:30Z

Apparently we have both, one for exponential and one for non-exponential retries 👍

gh-ost/go/logic/migrator.go

Lines 157 to 178 in b34b86d

    
           func (this *Migrator) retryOperationWithExponentialBackoff(operation func() error, notFatalHint ...bool) (err error) { 
        
           	var interval int64 
        
           	maxRetries := int(this.migrationContext.MaxRetries()) 
        
           	maxInterval := this.migrationContext.ExponentialBackoffMaxInterval 
        
           	for i := 0; i < maxRetries; i++ { 
        
           		newInterval := int64(math.Exp2(float64(i - 1))) 
        
           		if newInterval <= maxInterval { 
        
           			interval = newInterval 
        
           		} 
        
           		if i != 0 { 
        
           			time.Sleep(time.Duration(interval) * time.Second) 
        
           		} 
        
           		err = operation() 
        
           		if err == nil { 
        
           			return nil 
        
           		} 
        
           	} 
        
           	if len(notFatalHint) == 0 { 
        
           		this.migrationContext.PanicAbort <- err 
        
           	} 
        
           	return err 
        
           }

gh-ost/go/logic/migrator.go

Lines 133 to 150 in b34b86d

    
           func (this *Migrator) retryOperation(operation func() error, notFatalHint ...bool) (err error) { 
        
           	maxRetries := int(this.migrationContext.MaxRetries()) 
        
           	for i := 0; i < maxRetries; i++ { 
        
           		if i != 0 { 
        
           			// sleep after previous iteration 
        
           			time.Sleep(1 * time.Second) 
        
           		} 
        
           		err = operation() 
        
           		if err == nil { 
        
           			return nil 
        
           		} 
        
           		// there's an error. Let's try again. 
        
           	} 
        
           	if len(notFatalHint) == 0 { 
        
           		this.migrationContext.PanicAbort <- err 
        
           	} 
        
           	return err 
        
           }

…/dont-lock-rows

arthurschreiber · 2024-10-23T12:50:35Z

I updated the implementation to only add the NOWAIT clause if we're running on MySQL 8. I also updated the PR description (and added a note about retries).

arthurschreiber · 2024-10-23T12:54:29Z

What's the retry behavior in this failure case (and I guess the previous one when it quickly fails a NOWAIT lock)? I think we should probably use exponential backoff so we don't end up with a thundering herd problem due to row locking triggering excessive retries.

I opted not to modify the existing retry behaviour. It's not using an exponential backoff, but it's waiting for a second between retrying, and retrying up to 60 times I think. There's no risk of a thundering herd problem here, because there's only a single row copy process running and the 1 second pause is actually a fairly long duration.

arthurschreiber requested review from rashiq, meiji163 and timvaillancourt as code owners October 20, 2024 10:06

Do not lock rows when copying.

0a0d1f4

arthurschreiber force-pushed the arthur/dont-lock-rows branch from a49053f to 0a0d1f4 Compare October 20, 2024 10:16

danieljoos previously approved these changes Oct 21, 2024

View reviewed changes

arthurschreiber had a problem deploying to production October 21, 2024 08:35 Failure

arthurschreiber temporarily deployed to production October 21, 2024 08:41 Inactive

arthurschreiber added 2 commits October 22, 2024 17:56

Merge branch 'master' of https://github.com/github/gh-ost into arthur…

590f875

…/dont-lock-rows

Add for share nowait option.

16a88d8

arthurschreiber dismissed danieljoos’s stale review via 16a88d8 October 22, 2024 18:56

arthurschreiber added 3 commits October 22, 2024 21:07

Fix linter issues.

095fb26

Only specify for share nowait on transactional tables.

a91c386

Only use NOWAIT on MySQL 8.

5ddeb21

arthurschreiber changed the title ~~Do not lock rows when copying.~~ Avoid causing deadlocks when copying rows on active tables Oct 23, 2024

arthurschreiber changed the title ~~Avoid causing deadlocks when copying rows on active tables~~ Avoid causing deadlocks when copying rows on busy tables Oct 23, 2024

danieljoos approved these changes Oct 23, 2024

View reviewed changes

arthurschreiber temporarily deployed to production October 23, 2024 13:51 Inactive

arthurschreiber merged commit 30f28c2 into master Oct 23, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid causing deadlocks when copying rows on busy tables #1455

Avoid causing deadlocks when copying rows on busy tables #1455

arthurschreiber commented Oct 20, 2024 •

edited

Loading

arthurschreiber commented Oct 20, 2024

arthurschreiber commented Oct 21, 2024

drogart commented Oct 21, 2024

arthurschreiber commented Oct 21, 2024

arthurschreiber commented Oct 21, 2024

arthurschreiber commented Oct 23, 2024

arthurschreiber commented Oct 23, 2024

Avoid causing deadlocks when copying rows on busy tables #1455

Avoid causing deadlocks when copying rows on busy tables #1455

Conversation

arthurschreiber commented Oct 20, 2024 • edited Loading

Description

arthurschreiber commented Oct 20, 2024

arthurschreiber commented Oct 21, 2024

drogart commented Oct 21, 2024

arthurschreiber commented Oct 21, 2024

arthurschreiber commented Oct 21, 2024

arthurschreiber commented Oct 23, 2024

arthurschreiber commented Oct 23, 2024

arthurschreiber commented Oct 20, 2024 •

edited

Loading