db: rework metalock into RWMutex #998

roman-khimov · 2025-06-24T09:36:47Z

Background

See #967 for motivation and test used. This is the version 2.5 of the patch, it differs significantly from the one presented in #967. The original version worked well in production, but:

it had a known (mentioned in commit message) minor issue with DBs that just started and had no RW transactions
it had a race with db.freelist access between reader threads and db.close()
it was discovered later that it also is incompatible with NoFreelistSync option
it added new locks

So version 2 was created in 0a70458 that was relying on the old locks only which I think makes it much simpler, but it had to extend freelist API a little. Turned out, this version was flawed even though it was all green on tests. In production it failed to release pages correctly leading to DB bloat and total disaster. Luckily, it wasn't hard to fix and now we have version 2.5 which is presented here. It passes all tests and also has been tested on real DBs, so it should be OK to use. Still, it's the most complex change from #967 series, so please review carefully.

Change itself

metalock is documented to protect meta page access, but its real functionality is broader and less obvious:

meta page itself is accessed in t.init(), but it's not changed, so an exclusive lock is not needed
db.opened is somewhat related, but it's changed in close() only which takes all the locks anyway
db.data is used close to this lock, but it's protected by mmaplock
db.freelist is in fact changed when metalock is taken and it has no synchronization of its own (notice that db.freelist pointer is changed in close() as well, so some synchronization is needed for it also)

So we have a lock that tries to protect meta page (and does it after 249746f), but in fact it helps freelists more than the others. What we're trying to do in freelist is to keep track of open RO transaction IDs, maintaining a slice of them. Transaction IDs in their turn are not exactly IDs in that they do not identify transactions uniquely, rather it's an ID of the latest transaction that changed the state and at any given point all of the new readers use the same latest ID. The other important aspect of these IDs is that any RW transaction always uses a new ID that is incremented from the previous one, IDs only move forward (not considering uint64 overflow). At the very minimum this means that:

the most common case is when this list has exactly one unique ID which is the latest
occasionally we can have a previous one as well
only in rare cases of long-running read transactions we can have some set of older IDs
once an old ID is gone it'll never return since new transactions use the latest one

Which in turn means that:

keeping a list of IDs wastes memory for nothing, we usually have a lot of duplicates there
what we need in fact is some sort of reference counting for a limited number of past IDs
no new IDs are possible unless there is an RW transaction

Reference counting can be implemented with atomic variables, not requiring locks, the only problem is to always have an appropriate ID to count its users. This can be done via a separate RW-transaction-only API managing a list of ID-ref structures. It requires some protection from readers, but this can be tied to meta page update since RO transactions can't use new ID earlier than it appears there. Then RO transactions can rely on this list always having current ID that they can use, which means they can use RO lock for the list itself while doing changes to the counter.

This removes the final exclusive lock for read-only transactions (with disabled statistics) and this drastically improves concurrent View() test results as well as real application behavior with many readers. Before:

workers samples min             avg             50%             80%             90%             max
1       10      78.358µs        964.847µs       1.059159ms      1.073256ms      1.07551ms       1.07551ms
10      100     32.802µs        304.922µs       80.924µs        674.54µs        1.069298ms      1.220625ms
100     1000    30.758µs        304.541µs       64.192µs        397.094µs       1.101991ms      2.183302ms
1000    10000   30.558µs        1.05711ms       92.426µs        2.111896ms      3.317894ms      11.790014ms
10000   100000  30.548µs        10.98898ms      90.742µs        21.740659ms     33.020076ms     135.33094ms

After:

workers samples min             avg             50%             80%             90%             max
1       10      96.093µs        969.267µs       1.063618ms      1.066473ms      1.085629ms      1.085629ms
10      100     32.803µs        252.33µs        73.71µs         827.159µs       934.392µs       1.071142ms
100     1000    31.339µs        239.554µs       38.703µs        613.263µs       888.616µs       1.403793ms
1000    10000   30.518µs        195.822µs       41.538µs        101.031µs       360.474µs       4.075932ms
10000   100000  30.478µs        165.346µs       39.054µs        89.981µs        144.544µs       11.35032ms

metalock is documented to protect meta page access, but its real functionality is broader and less obvious: * meta page itself is accessed in t.init(), but it's not changed, so an exclusive lock is not needed * db.opened is somewhat related, but it's changed in close() only which takes all the locks anyway * db.data is used close to this lock, but it's protected by mmaplock * db.freelist is in fact changed when metalock is taken and it has no synchronization of its own (notice that db.freelist pointer is changed in close() as well, so some synchronization is needed for it also) So we have a lock that tries to protect meta page (and does it after 249746f), but in fact it helps freelists more than the others. What we're trying to do in freelist is to keep track of open RO transaction IDs, maintaining a slice of them. Transaction IDs in their turn are not exactly IDs in that they do not identify transactions uniquely, rather it's an ID of the latest transaction that changed the state and at any given point all of the new readers use the same latest ID. The other important aspect of these IDs is that any RW transaction always uses a new ID that is incremented from the previous one, IDs only move forward (not considering uint64 overflow). At the very minimum this means that: * the most common case is when this list has exactly one unique ID which is the latest * occasionally we can have a previous one as well * only in rare cases of long-running read transactions we can have some set of older IDs * once an old ID is gone it'll never return since new transactions use the latest one Which in turn means that: * keeping a list of IDs wastes memory for nothing, we usually have a lot of duplicates there * what we need in fact is some sort of reference counting for a limited number of past IDs * no new IDs are possible unless there is an RW transaction Reference counting can be implemented with atomic variables, not requiring locks, the only problem is to always have an appropriate ID to count its users. This can be done via a separate RW-transaction-only API managing a list of ID-ref structures. It requires some protection from readers, but this can be tied to meta page update since RO transactions can't use new ID earlier than it appears there. Then RO transactions can rely on this list always having current ID that they can use, which means they can use RO lock for the list itself while doing changes to the counter. This removes the final exclusive lock for read-only transactions (with disabled statistics) and this drastically improves concurrent View() test results as well as real application behavior with many readers. Before: workers samples min avg 50% 80% 90% max 1 10 78.358µs 964.847µs 1.059159ms 1.073256ms 1.07551ms 1.07551ms 10 100 32.802µs 304.922µs 80.924µs 674.54µs 1.069298ms 1.220625ms 100 1000 30.758µs 304.541µs 64.192µs 397.094µs 1.101991ms 2.183302ms 1000 10000 30.558µs 1.05711ms 92.426µs 2.111896ms 3.317894ms 11.790014ms 10000 100000 30.548µs 10.98898ms 90.742µs 21.740659ms 33.020076ms 135.33094ms After: workers samples min avg 50% 80% 90% max 1 10 96.093µs 969.267µs 1.063618ms 1.066473ms 1.085629ms 1.085629ms 10 100 32.803µs 252.33µs 73.71µs 827.159µs 934.392µs 1.071142ms 100 1000 31.339µs 239.554µs 38.703µs 613.263µs 888.616µs 1.403793ms 1000 10000 30.518µs 195.822µs 41.538µs 101.031µs 360.474µs 4.075932ms 10000 100000 30.478µs 165.346µs 39.054µs 89.981µs 144.544µs 11.35032ms Signed-off-by: Roman Khimov <[email protected]>

k8s-ci-robot · 2025-06-24T09:36:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: roman-khimov
Once this PR has been reviewed and has the lgtm label, please assign ptabor for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-06-24T09:36:58Z

Hi @roman-khimov. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ahrtr · 2025-07-06T16:12:43Z

I don't feel safe before we simplify the free list management. Please see #789 (comment)

roman-khimov · 2025-07-06T18:48:03Z

This patch works for us and we're using it in production. But I do understand your concerns and as noted above, I did break freelist management at some point of development without noticing it in any tests, so yeah, it's somewhat fragile. Take it as an idea then, I also had a thought that some freelist internals seem to be duplicating and maybe the structure presented here can be reused for other purposes as well. But the main thing for us was breaking this exclusive lock requirement, it changes a lot performance-wise.

Make a complete (though temporary) fork to solve installation problems. This revision is based on Bolt 1.4.3 with patches applied from main (already merged) and etcd-io/bbolt#998. Fixes #3964. Signed-off-by: Roman Khimov <[email protected]>

1. It updates to BoltDB 1.4.3 effectively with some fixes (but requires additional testing now). 2. We can't solve nspcc-dev/neo-go#3964 without using a module of our own and we want to use the same BoltDB version in all projects. 3. This is supposed to be temporary, but no one knows how long etcd-io/bbolt#998 (or some alternative) will take. Signed-off-by: Roman Khimov <[email protected]>

k8s-ci-robot added size/M needs-ok-to-test labels Jun 24, 2025

roman-khimov mentioned this pull request Jun 24, 2025

Improve RO transaction setup #967

Closed

This was referenced Sep 11, 2025

go.mod: use nspcc-dev/bbolt instead of go.etcd.io/bbolt nspcc-dev/neo-go#4006

Merged

replace directive in go.mod makes it harder to "go install" neo-go nspcc-dev/neo-go#3964

Closed

roman-khimov mentioned this pull request Sep 11, 2025

go.mod: switch to nspcc-dev/bbolt completely nspcc-dev/neofs-node#3576

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

db: rework metalock into RWMutex #998

db: rework metalock into RWMutex #998

Uh oh!

roman-khimov commented Jun 24, 2025

Uh oh!

k8s-ci-robot commented Jun 24, 2025

Uh oh!

k8s-ci-robot commented Jun 24, 2025

Uh oh!

ahrtr commented Jul 6, 2025

Uh oh!

roman-khimov commented Jul 6, 2025

Uh oh!

Uh oh!

db: rework metalock into RWMutex #998

Are you sure you want to change the base?

db: rework metalock into RWMutex #998

Uh oh!

Conversation

roman-khimov commented Jun 24, 2025

Background

Change itself

Uh oh!

k8s-ci-robot commented Jun 24, 2025

Uh oh!

k8s-ci-robot commented Jun 24, 2025

Uh oh!

ahrtr commented Jul 6, 2025

Uh oh!

roman-khimov commented Jul 6, 2025

Uh oh!

Uh oh!