-
Notifications
You must be signed in to change notification settings - Fork 703
db: rework metalock into RWMutex #998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
metalock is documented to protect meta page access, but its real functionality is broader and less obvious: * meta page itself is accessed in t.init(), but it's not changed, so an exclusive lock is not needed * db.opened is somewhat related, but it's changed in close() only which takes all the locks anyway * db.data is used close to this lock, but it's protected by mmaplock * db.freelist is in fact changed when metalock is taken and it has no synchronization of its own (notice that db.freelist pointer is changed in close() as well, so some synchronization is needed for it also) So we have a lock that tries to protect meta page (and does it after 249746f), but in fact it helps freelists more than the others. What we're trying to do in freelist is to keep track of open RO transaction IDs, maintaining a slice of them. Transaction IDs in their turn are not exactly IDs in that they do not identify transactions uniquely, rather it's an ID of the latest transaction that changed the state and at any given point all of the new readers use the same latest ID. The other important aspect of these IDs is that any RW transaction always uses a new ID that is incremented from the previous one, IDs only move forward (not considering uint64 overflow). At the very minimum this means that: * the most common case is when this list has exactly one unique ID which is the latest * occasionally we can have a previous one as well * only in rare cases of long-running read transactions we can have some set of older IDs * once an old ID is gone it'll never return since new transactions use the latest one Which in turn means that: * keeping a list of IDs wastes memory for nothing, we usually have a lot of duplicates there * what we need in fact is some sort of reference counting for a limited number of past IDs * no new IDs are possible unless there is an RW transaction Reference counting can be implemented with atomic variables, not requiring locks, the only problem is to always have an appropriate ID to count its users. This can be done via a separate RW-transaction-only API managing a list of ID-ref structures. It requires some protection from readers, but this can be tied to meta page update since RO transactions can't use new ID earlier than it appears there. Then RO transactions can rely on this list always having current ID that they can use, which means they can use RO lock for the list itself while doing changes to the counter. This removes the final exclusive lock for read-only transactions (with disabled statistics) and this drastically improves concurrent View() test results as well as real application behavior with many readers. Before: workers samples min avg 50% 80% 90% max 1 10 78.358µs 964.847µs 1.059159ms 1.073256ms 1.07551ms 1.07551ms 10 100 32.802µs 304.922µs 80.924µs 674.54µs 1.069298ms 1.220625ms 100 1000 30.758µs 304.541µs 64.192µs 397.094µs 1.101991ms 2.183302ms 1000 10000 30.558µs 1.05711ms 92.426µs 2.111896ms 3.317894ms 11.790014ms 10000 100000 30.548µs 10.98898ms 90.742µs 21.740659ms 33.020076ms 135.33094ms After: workers samples min avg 50% 80% 90% max 1 10 96.093µs 969.267µs 1.063618ms 1.066473ms 1.085629ms 1.085629ms 10 100 32.803µs 252.33µs 73.71µs 827.159µs 934.392µs 1.071142ms 100 1000 31.339µs 239.554µs 38.703µs 613.263µs 888.616µs 1.403793ms 1000 10000 30.518µs 195.822µs 41.538µs 101.031µs 360.474µs 4.075932ms 10000 100000 30.478µs 165.346µs 39.054µs 89.981µs 144.544µs 11.35032ms Signed-off-by: Roman Khimov <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: roman-khimov The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @roman-khimov. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I don't feel safe before we simplify the free list management. Please see #789 (comment) |
This patch works for us and we're using it in production. But I do understand your concerns and as noted above, I did break freelist management at some point of development without noticing it in any tests, so yeah, it's somewhat fragile. Take it as an idea then, I also had a thought that some freelist internals seem to be duplicating and maybe the structure presented here can be reused for other purposes as well. But the main thing for us was breaking this exclusive lock requirement, it changes a lot performance-wise. |
Make a complete (though temporary) fork to solve installation problems. This revision is based on Bolt 1.4.3 with patches applied from main (already merged) and etcd-io/bbolt#998. Fixes #3964. Signed-off-by: Roman Khimov <[email protected]>
1. It updates to BoltDB 1.4.3 effectively with some fixes (but requires additional testing now). 2. We can't solve nspcc-dev/neo-go#3964 without using a module of our own and we want to use the same BoltDB version in all projects. 3. This is supposed to be temporary, but no one knows how long etcd-io/bbolt#998 (or some alternative) will take. Signed-off-by: Roman Khimov <[email protected]>
1. It updates to BoltDB 1.4.3 effectively with some fixes (but requires additional testing now). 2. We can't solve nspcc-dev/neo-go#3964 without using a module of our own and we want to use the same BoltDB version in all projects. 3. This is supposed to be temporary, but no one knows how long etcd-io/bbolt#998 (or some alternative) will take. Signed-off-by: Roman Khimov <[email protected]>
Background
See #967 for motivation and test used. This is the version 2.5 of the patch, it differs significantly from the one presented in #967. The original version worked well in production, but:
db.freelist
access between reader threads anddb.close()
NoFreelistSync
optionSo version 2 was created in 0a70458 that was relying on the old locks only which I think makes it much simpler, but it had to extend freelist API a little. Turned out, this version was flawed even though it was all green on tests. In production it failed to release pages correctly leading to DB bloat and total disaster. Luckily, it wasn't hard to fix and now we have version 2.5 which is presented here. It passes all tests and also has been tested on real DBs, so it should be OK to use. Still, it's the most complex change from #967 series, so please review carefully.
Change itself
metalock is documented to protect meta page access, but its real functionality is broader and less obvious:
So we have a lock that tries to protect meta page (and does it after 249746f), but in fact it helps freelists more than the others. What we're trying to do in freelist is to keep track of open RO transaction IDs, maintaining a slice of them. Transaction IDs in their turn are not exactly IDs in that they do not identify transactions uniquely, rather it's an ID of the latest transaction that changed the state and at any given point all of the new readers use the same latest ID. The other important aspect of these IDs is that any RW transaction always uses a new ID that is incremented from the previous one, IDs only move forward (not considering uint64 overflow). At the very minimum this means that:
Which in turn means that:
Reference counting can be implemented with atomic variables, not requiring locks, the only problem is to always have an appropriate ID to count its users. This can be done via a separate RW-transaction-only API managing a list of ID-ref structures. It requires some protection from readers, but this can be tied to meta page update since RO transactions can't use new ID earlier than it appears there. Then RO transactions can rely on this list always having current ID that they can use, which means they can use RO lock for the list itself while doing changes to the counter.
This removes the final exclusive lock for read-only transactions (with disabled statistics) and this drastically improves concurrent View() test results as well as real application behavior with many readers. Before:
After: