-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid causing deadlocks when copying rows on busy tables #1455
Conversation
a49053f
to
0a0d1f4
Compare
I've noticed we run into deadlocks when we copy rows while there's high write load on the original table. I was able to trace this down to this statement. The situation we're running into is essentially this:
@timvaillancourt Have you seen this before? What's your take on the suggested change? I'll try running this on our test systems and see if there's any data consistency issues introduced (but I don't think there will be). |
This change actually makes no difference at all.
Removing or adding the The proper way to prevent these deadlocks will be to use For MySQL 5.7 (not sure if we want to support that?), an alternative would be to set the |
Is it possible to increment a counter and emit it as a metric when this happens? It's probably useful to know when the nominal app workload is consistently blocking a gh-ost copy. I wonder also if the use of the read lock is to ensure consistency for mysql instances not using
What's the retry behavior in this failure case (and I guess the previous one when it quickly fails a NOWAIT lock)? I think we should probably use exponential backoff so we don't end up with a thundering herd problem due to row locking triggering excessive retries. |
@drogart Thanks for taking a look. There's some existing retry helper that's used in a bunch of places in gh-ost, but I don't remember if it does exponential backoff or just direct retries. But I agree that an exponential retry is the way to go here. |
Apparently we have both, one for exponential and one for non-exponential retries 👍 Lines 157 to 178 in b34b86d
Lines 133 to 150 in b34b86d
|
I updated the implementation to only add the |
I opted not to modify the existing retry behaviour. It's not using an exponential backoff, but it's waiting for a second between retrying, and retrying up to 60 times I think. There's no risk of a thundering herd problem here, because there's only a single row copy process running and the 1 second pause is actually a fairly long duration. |
Description
We've noticed quite a lot of deadlocks happening when running a
gh-ost
migration on a relatively busy table (with lots of DML transactions spanning multiple rows).We traced this down to the
INSERT INTO ... (SELECT ...)
query that's executed to copy rows from the original table to the ghost table. This query will lock rows using aS
lock in the order of the unique key use for the migration (most likely that's thePRIMARY
key). If other queries end up locking rows in a different order, it's fairly easy to run into deadlocks.These deadlocks then cause either queries to be blocked, or to fail once the deadlock detection kicks in.
This pull request updates the row copy query to use
for share nowait
on the innerSELECT
part of the query on MySQL 8. This will make the row copy process bailout early if it can't get theS
locks instantly on the rows that it's trying to copy, preventing deadlocks from happening. If this happens, the regular retry mechanism in gh-ost will wait for a second, and then try this operation again.script/cibuild
returns with no formatting errors, build errors or unit test errors.