Skip to content

Conversation

roman-khimov
Copy link
Contributor

Background

See #967 for motivation and test used. This is the version 2.5 of the patch, it differs significantly from the one presented in #967. The original version worked well in production, but:

  • it had a known (mentioned in commit message) minor issue with DBs that just started and had no RW transactions
  • it had a race with db.freelist access between reader threads and db.close()
  • it was discovered later that it also is incompatible with NoFreelistSync option
  • it added new locks

So version 2 was created in 0a70458 that was relying on the old locks only which I think makes it much simpler, but it had to extend freelist API a little. Turned out, this version was flawed even though it was all green on tests. In production it failed to release pages correctly leading to DB bloat and total disaster. Luckily, it wasn't hard to fix and now we have version 2.5 which is presented here. It passes all tests and also has been tested on real DBs, so it should be OK to use. Still, it's the most complex change from #967 series, so please review carefully.

Change itself

metalock is documented to protect meta page access, but its real functionality is broader and less obvious:

  • meta page itself is accessed in t.init(), but it's not changed, so an exclusive lock is not needed
  • db.opened is somewhat related, but it's changed in close() only which takes all the locks anyway
  • db.data is used close to this lock, but it's protected by mmaplock
  • db.freelist is in fact changed when metalock is taken and it has no synchronization of its own (notice that db.freelist pointer is changed in close() as well, so some synchronization is needed for it also)

So we have a lock that tries to protect meta page (and does it after 249746f), but in fact it helps freelists more than the others. What we're trying to do in freelist is to keep track of open RO transaction IDs, maintaining a slice of them. Transaction IDs in their turn are not exactly IDs in that they do not identify transactions uniquely, rather it's an ID of the latest transaction that changed the state and at any given point all of the new readers use the same latest ID. The other important aspect of these IDs is that any RW transaction always uses a new ID that is incremented from the previous one, IDs only move forward (not considering uint64 overflow). At the very minimum this means that:

  • the most common case is when this list has exactly one unique ID which is the latest
  • occasionally we can have a previous one as well
  • only in rare cases of long-running read transactions we can have some set of older IDs
  • once an old ID is gone it'll never return since new transactions use the latest one

Which in turn means that:

  • keeping a list of IDs wastes memory for nothing, we usually have a lot of duplicates there
  • what we need in fact is some sort of reference counting for a limited number of past IDs
  • no new IDs are possible unless there is an RW transaction

Reference counting can be implemented with atomic variables, not requiring locks, the only problem is to always have an appropriate ID to count its users. This can be done via a separate RW-transaction-only API managing a list of ID-ref structures. It requires some protection from readers, but this can be tied to meta page update since RO transactions can't use new ID earlier than it appears there. Then RO transactions can rely on this list always having current ID that they can use, which means they can use RO lock for the list itself while doing changes to the counter.

This removes the final exclusive lock for read-only transactions (with disabled statistics) and this drastically improves concurrent View() test results as well as real application behavior with many readers. Before:

workers samples min             avg             50%             80%             90%             max
1       10      78.358µs        964.847µs       1.059159ms      1.073256ms      1.07551ms       1.07551ms
10      100     32.802µs        304.922µs       80.924µs        674.54µs        1.069298ms      1.220625ms
100     1000    30.758µs        304.541µs       64.192µs        397.094µs       1.101991ms      2.183302ms
1000    10000   30.558µs        1.05711ms       92.426µs        2.111896ms      3.317894ms      11.790014ms
10000   100000  30.548µs        10.98898ms      90.742µs        21.740659ms     33.020076ms     135.33094ms

After:

workers samples min             avg             50%             80%             90%             max
1       10      96.093µs        969.267µs       1.063618ms      1.066473ms      1.085629ms      1.085629ms
10      100     32.803µs        252.33µs        73.71µs         827.159µs       934.392µs       1.071142ms
100     1000    31.339µs        239.554µs       38.703µs        613.263µs       888.616µs       1.403793ms
1000    10000   30.518µs        195.822µs       41.538µs        101.031µs       360.474µs       4.075932ms
10000   100000  30.478µs        165.346µs       39.054µs        89.981µs        144.544µs       11.35032ms

metalock is documented to protect meta page access, but its real functionality
is broader and less obvious:
 * meta page itself is accessed in t.init(), but it's not changed, so an
   exclusive lock is not needed
 * db.opened is somewhat related, but it's changed in close() only which takes
   all the locks anyway
 * db.data is used close to this lock, but it's protected by mmaplock
 * db.freelist is in fact changed when metalock is taken and it has no
   synchronization of its own (notice that db.freelist pointer is changed in
   close() as well, so some synchronization is needed for it also)

So we have a lock that tries to protect meta page (and does it after
249746f), but in fact it helps
freelists more than the others. What we're trying to do in freelist is to keep
track of open RO transaction IDs, maintaining a slice of them. Transaction IDs
in their turn are not exactly IDs in that they do not identify transactions
uniquely, rather it's an ID of the latest transaction that changed the state
and at any given point all of the new readers use the same latest ID. The
other important aspect of these IDs is that any RW transaction always uses
a new ID that is incremented from the previous one, IDs only move forward
(not considering uint64 overflow). At the very minimum this means that:
 * the most common case is when this list has exactly one unique ID which is
   the latest
 * occasionally we can have a previous one as well
 * only in rare cases of long-running read transactions we can have some set
   of older IDs
 * once an old ID is gone it'll never return since new transactions use the
   latest one

Which in turn means that:
 * keeping a list of IDs wastes memory for nothing, we usually have a lot of
   duplicates there
 * what we need in fact is some sort of reference counting for a limited
   number of past IDs
 * no new IDs are possible unless there is an RW transaction

Reference counting can be implemented with atomic variables, not requiring
locks, the only problem is to always have an appropriate ID to count its users.
This can be done via a separate RW-transaction-only API managing a list of
ID-ref structures. It requires some protection from readers, but this can be
tied to meta page update since RO transactions can't use new ID earlier than
it appears there. Then RO transactions can rely on this list always having
current ID that they can use, which means they can use RO lock for the list
itself while doing changes to the counter.

This removes the final exclusive lock for read-only transactions (with disabled
statistics) and this drastically improves concurrent View() test results as well
as real application behavior with many readers. Before:

workers samples min             avg             50%             80%             90%             max
1       10      78.358µs        964.847µs       1.059159ms      1.073256ms      1.07551ms       1.07551ms
10      100     32.802µs        304.922µs       80.924µs        674.54µs        1.069298ms      1.220625ms
100     1000    30.758µs        304.541µs       64.192µs        397.094µs       1.101991ms      2.183302ms
1000    10000   30.558µs        1.05711ms       92.426µs        2.111896ms      3.317894ms      11.790014ms
10000   100000  30.548µs        10.98898ms      90.742µs        21.740659ms     33.020076ms     135.33094ms

After:

workers samples min             avg             50%             80%             90%             max
1       10      96.093µs        969.267µs       1.063618ms      1.066473ms      1.085629ms      1.085629ms
10      100     32.803µs        252.33µs        73.71µs         827.159µs       934.392µs       1.071142ms
100     1000    31.339µs        239.554µs       38.703µs        613.263µs       888.616µs       1.403793ms
1000    10000   30.518µs        195.822µs       41.538µs        101.031µs       360.474µs       4.075932ms
10000   100000  30.478µs        165.346µs       39.054µs        89.981µs        144.544µs       11.35032ms

Signed-off-by: Roman Khimov <[email protected]>
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: roman-khimov
Once this PR has been reviewed and has the lgtm label, please assign ptabor for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @roman-khimov. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ahrtr
Copy link
Member

ahrtr commented Jul 6, 2025

I don't feel safe before we simplify the free list management. Please see #789 (comment)

@roman-khimov
Copy link
Contributor Author

This patch works for us and we're using it in production. But I do understand your concerns and as noted above, I did break freelist management at some point of development without noticing it in any tests, so yeah, it's somewhat fragile. Take it as an idea then, I also had a thought that some freelist internals seem to be duplicating and maybe the structure presented here can be reused for other purposes as well. But the main thing for us was breaking this exclusive lock requirement, it changes a lot performance-wise.

roman-khimov added a commit to nspcc-dev/neo-go that referenced this pull request Sep 11, 2025
Make a complete (though temporary) fork to solve installation problems.
This revision is based on Bolt 1.4.3 with patches applied from main (already
merged) and etcd-io/bbolt#998.

Fixes #3964.

Signed-off-by: Roman Khimov <[email protected]>
roman-khimov added a commit to nspcc-dev/neofs-node that referenced this pull request Sep 11, 2025
1. It updates to BoltDB 1.4.3 effectively with some fixes (but requires
   additional testing now).
2. We can't solve nspcc-dev/neo-go#3964 without
   using a module of our own and we want to use the same BoltDB version
   in all projects.
3. This is supposed to be temporary, but no one knows how long etcd-io/bbolt#998
   (or some alternative) will take.

Signed-off-by: Roman Khimov <[email protected]>
roman-khimov added a commit to nspcc-dev/neofs-node that referenced this pull request Sep 11, 2025
1. It updates to BoltDB 1.4.3 effectively with some fixes (but requires
   additional testing now).
2. We can't solve nspcc-dev/neo-go#3964 without
   using a module of our own and we want to use the same BoltDB version
   in all projects.
3. This is supposed to be temporary, but no one knows how long etcd-io/bbolt#998
   (or some alternative) will take.

Signed-off-by: Roman Khimov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants