Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers/quota.Control: lock a mutex when called #1604

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nalind
Copy link
Member

@nalind nalind commented May 18, 2023

The "quotas" map and "next project ID" field don't take well to being created and updated concurrently in multiple threads. Throw a lock around the whole object when any of its exported methods are called. Fixes #1394.

@rhatdan
Copy link
Member

rhatdan commented May 18, 2023

LGTM

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the local object level LGTM.

(Internal users don’t need this because the internal user is the overlay driver, called under the layer store lock — but CRI-O is calling driver.ReadWriteDiskUsage.)


Oh, actually external callers have direct access to the graph driver via store.GraphDriver??! What does that mean for #1214 ? Do we need to re-add that, and add other synchronization on all the public methods of all drivers?

Does the ReadWriteDiskUsage need to care about concurrent layer updates / removals? I.e. should the actual locking happen at the ReadWriteDiskUsage API level, to synchronize with all other layer updates by the graph driver?

Or is the right decision to pretend that the public access to the graph driver mostly doesn’t exist, it’s the caller’s responsibility not to do that, and we treat ReadWriteDiskUsage as a special case highly-optimized for concurrency? If that’s the intent, I think it would be useful to have widespread comments pointing that special case out.

@mtrmac
Copy link
Collaborator

mtrmac commented May 18, 2023

(I appreciate that this fixes #1394 as is, and it’s better to have this PR than to have nothing. It’s just… a whole new can of worms I was not aware of.

Even if the individual disk usage computation was fine vs. concurrent updates/removals, and it would be fine to return such mostly-invalid data, I’m at least worried about things like Driver.dir not being aware of the need to correctly work outside of the usual locks.)

@nalind
Copy link
Member Author

nalind commented May 22, 2023

It's not exposed through the Store object. I think this is about library consumers calling into the quota package directly, which is something podman has apparently been doing for a while.

The "quotas" map and "next project ID" field don't take well to being
created and updated concurrently in multiple threads.  Throw a lock
around the whole object when any of its exported methods are called.

Signed-off-by: Nalin Dahyabhai <[email protected]>
@nalind nalind force-pushed the lock-quota-control branch from edb07d4 to 8db5cc4 Compare May 22, 2023 12:41
@nalind
Copy link
Member Author

nalind commented May 22, 2023

Rebased.

@mtrmac
Copy link
Collaborator

mtrmac commented May 22, 2023

The CRI-O backtrace in the original report points at https://github.com/cri-o/cri-o/blob/80228a5d6a4b26e9ed2363ca6a936f5fb604c607/internal/lib/stats/stats_server.go#L189-L197 , which uses a published method of Store:

GraphDriver() (drivers.Driver, error)
.

(The Podman user I can find is https://github.com/containers/podman/blob/b15510694b81929dbc107d354fdb0a83d0bfaa2a/libpod/runtime_volume_common.go#L181-L197 , but that’s a private single-use quota.Control, so I can’t see that that one requires locking.)

@nalind
Copy link
Member Author

nalind commented May 22, 2023

Yikes, I thought we'd dropped that somewhere along the way.

@mtrmac
Copy link
Collaborator

mtrmac commented May 22, 2023

WDYT about marking Store.GraphDriver deprecated?

That would require giving CRI-O a new Store API to get the container size, and that could lock via readAllLayerStores. And something to expose Driver.Metadata for Podman’s inspect functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Project Quota map is not thread-safe (concurrent map read and map write error)
3 participants