-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix missing cases of corruption retries #13122
base: main
Are you sure you want to change the base?
Conversation
// is allocated with max_open_files - 10 as capacity. So override | ||
// max_open_files to 11 so table cache capacity will become 1. This will | ||
// prevent file open during DB open and force the file to be opened | ||
// during MultiGet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice trick!
} | ||
if (s.ok()) { | ||
s = ValidityCheck(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if (s.ok()) { return s; }
after this validity check would be more readable to me then setting retry = false at the else in line 349. Or you could explicitly return on line 349.
s = ValidityCheck(); | ||
} | ||
if (!s.ok()) { | ||
if ((s.IsCorruption() || s.IsInvalidArgument()) && !retry && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for my own learning. In what scenario that we'd get s.IsInvalidArgument()
here and retry with kVerifyAndReconstructRead would succeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any syntax errors during parsing are being considered as Status::InvalidArgument()
. The syntax error could be due to corruption, and can potentially be corrected by retrying.
db/db_io_failure_test.cc
Outdated
return s; | ||
} | ||
|
||
// This means the next read after injecting corruption was not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if the comment was cut
db/db_io_failure_test.cc
Outdated
ss << std::setw(3) << 100 * sst + key; | ||
ASSERT_OK(Put("key" + ss.str(), "val" + ss.str())); | ||
} | ||
Flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASSERT_OK(Flush());
} | ||
Flush(); | ||
} | ||
Close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASSERT_OK(Close());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like DBTestBase::Close()
does not return a Status.
@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This PR fixes a few cases where RocksDB was not retrying checksum failure/corruption of file reads with the
verify_and_reconstruct_read
IO option. After fixing these cases, we can almost always successfully open the DB and execute reads even if we see transient corruptions, provided theFileSystem
supports theverify_and_reconstruct_read
option. The specific cases fixed in this PR are -Test plan:
Unit test in
db_io_failure_test.cc
that injects corruption at various stages of DB open and reads