-
Notifications
You must be signed in to change notification settings - Fork 228
Add registry proxying section #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
42589cd
to
e5df360
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments & questions.
Why do clients need to know anything about pull-through caching if its implemented server-side? |
e5df360
to
29fdf25
Compare
The clients should know how the registry host was resolved from a given image reference. The clients don't care how the server is implemented, but they SHOULD provide information to the server which indicates what the reference being asked for is. Just as when an HTTP client connects with a PROXY server it must communicate what the upstream server is, the same is true here. Today the protocol doesn't define anyway to communicate what the upstream is and proxies end up be hardcoded to a single upstream. In a few cases you can see proxies use custom domains per upstream and require users to change the name of their images in order to use them. |
Right... isn't that the point? If I encode that "myregistry/mynamespace/myrepo" goes to "upstream/foo/bar", that's a detail for the maintainer of If the goal is to allow the client to specify "upstream/foo/bar", then I'd that the target is not really a repository anymore, but simply a working proxy, and thus, a different protocol parameter might be useful, but registries should therefore have the option to not support said parameter. |
That is one use case that will still work. In the example you mentioned, when a repository is proxied in that fashion, the puller often does know of this detail as they must explicitly provide myregistry with the intent of getting some upstream content. The use case where myregistry is some sort of blessed version of upstream is reasonable, but not the intent of the namespace parameter here.
This is the use case here and proxy may be better terminology here, but that is really a detail of the registry. The registry may act as a proxy, proxy-cache, or active mirror, that is out of scope for definition here. This parameter just enables all of those features to work across multiple namespaces. For example if you want public images from both |
Or configure two repositories, one for each? (especially since combining them could lead to merge conflicts). I'm concerned we're adding quite a bit of complexity to address a use case that has simpler solutions when configured on the registry side. |
Can you elaborate here? Configuring a repository for each mirror is non-trivial. Configuring a domain for each upstream and routing to the upstream based on the domain is not easier, that would still requires the same routing on the server side that an implementation of this would require. The client side implementation to support per-registry configuration is not simple and inherently requires catch -all conditions when trying to enforce proxying through a gateway. I did do a client side implementation of this to demonstrate the feature and allow server side implementations a client to test against. On the client side, it is not complex at all since clients should already know how to handle 404s when multiple registry endpoints are configured. On the server side, the complexity to support this isn't much more than existing proxy-cache support. |
Its non-trivial, but its not that difficult either :) My concern remains around complexity: the document as outlined, for example, says that the If we feel that pass-through proxying of other registries is, in and of itself, a feature of the protocol (rather than something configured on the registry side), then I suspect we need to give significantly more thought to the end-to-end user experience. For example, I could imagine some paths supporting proxying and others not. |
The clients have the most context and really does not need to be defined here, only that a client SHOULD make that distinction to avoid sending unnecessary redundant information. The clients themselves have both the configuration and endpoint resolution logic, so it has multiple options for determining this. In the implementation I sent I just simply did this by checking whether the endpoint was configured without push support, as this could indicate the registry being communicated with may not be the upstream source. However, I will probably add a check there for
No, the registry can simply ignore it. This is like asking a registry today which was configured to mirror docker.io to return an error if the client actually meant quay.io, the registry just isn't expected to have the same amount of context as a client in regards to the intent of the entire pull process. If the registry chooses to be handle the
They aren't expected to check for it, but rather be explicitly configured for it. A client will know if it is configured to always use a specific mirror or a mirror for multiple namespaces. I think what you are suggesting here though is the idea of registry discovery. That is a much larger topic that I would still love to see happen, in that feature a client could start with zero knowledge (except of course the domain quay.io, docker.io, etc) and discover registry capabilities and endpoints. |
discussion ensues on the call today. |
pinging @thomasmckay and @kurtismullins who are implementing mirroring on Quay -- they probably have feedback and want to track this thread |
is `ns` already used, so best to continue with that mnemonic? or could
it be something to not collide with the outdated concept that images
would on be named "transport/namespace/name:tag"?
|
@vbatts I use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think this is a great addition. It might be a good idea to add a few combinatoric examples of how ns and the the repo name are combined to calculate the upstream and local mirroring location.
@dmcgowan one thing i'm unclear on here is: can i have a single registry mirror that will be usable for more than one remote registry (i.e. remote of |
Notes from today's call:
Please lets find a way to classify this language (whether client or server side). So we can close out or merge this |
Reiterating our convo from the meeting: I actually see a lot of value in adding this query parameter, but removing any connotation that it is the blessed solution for repository mirroring. I think that by including this value, a proxy could implement lots of different behavior for the client that need not be directly related to repository mirroring. |
I note that harbor uses the term "replication" rather than proxy-cache or mirroring, which I quite like https://goharbor.io/docs/1.10/administration/configuring-replication/ |
658a97c
to
1e3a9a4
Compare
+1 |
Operating https://registry.k8s.io we see a TON of requests that are obviously from pull-through caches/mirrors like artifactory for images that we don't host. If this API was standardized we could eventually start requesting that these tools switch to ns-aware pulls and stop spamming us with requests for images we don't provide. While not the top of our concerns, this feels wasteful and generates a lot of log noise, and isn't free. The current state where pull-through mirroring pretty much has to request all upstreams feels bad, what concerns remain with this proposal? |
I'd guess this is from users thinking they have setup a proxy, and docker automatically falling back when the proxy pull fails. This is made worse by docker only supporting a proxy to Docker Hub. The result is the proxy does nothing, never used to pull k8s images (since the client would pull directly) and never finding a match (since the proxy isn't pointing to Hub) for the Hub pulls.
Are there proxies doing this today? The few I've used are configured for a single upstream registry, at least for a given namespace or port. I worry we'll all agree this is a good change, but the key players in the space won't implement it. It would be good to see this being actively used by both clients and proxies. We also need to consider existing clients and proxies in various combinations. E.g. if a proxy for a single registry is working with older clients and gets upgraded to use the headers, those older clients still need to go to the original registry. Similarly, newer clients configured with an older proxy should not pull content from the wrong upstream registry, since that would expose them to attacks by dependency confusion and namespace squatters. |
The issue where docker only supports configuring a mirror for dockerhub is also pretty unfortunate but not what I'm referring to, the headers are for agents like artifactory (and other similar tools). My understanding is that users setup a central catch-all pull-through cache and have no way to disambiguate which upstream the pulls were meant for, so we get pulls for ... a lot of images clearly not associated with us. This seems to be common with self hosted options, versus something like ECR/AR remote repositories. It's understandable that adding a distinct IP / hostname for every proxied host is heavy handed, and I don't think most users even think to use a port other than 5000.
AFAICT, yes, for example: https://jfrog.com/help/r/jfrog-artifactory-documentation/virtual-docker-repositories "This allows you to access images that are hosted locally on local Docker repositories, as well as remote images that are proxied by remote Docker repositories, and access all of them from a single URL defined for the virtual repository. Using virtual repositories can be very useful since users will continue to work with the virtual repository while the admin can manage the included repositories, replace the default deployment target and those changes will be transparent to the users. [...]"
This seems a bit circular? If we don't document the spec for this optional feature then most projects are not going to enable support (e.g. distribution/distribution#3864 (comment))
That's true, but a proxy can behave more intelligently when the namespace parameter is present and over time more clients will have it, there are various references to waiting for this to be standard. The proxies can fall back to existing behavior such as querying all upstreams when the header is missing. Containerd alone represents a lot of clients that already have this header in place. Were it to become standard and proxies begin supporting it there's actually a pretty large client base sending it already, but there's resistance to supporting it more widely referring back to waiting for this PR. |
I'm not sure how you'd differentiate the scenarios from the k8s registry. It would just look like a bunch of failed image pulls to that registry with either scenario.
Ugh, someone should open a dependency confusion CVE against them if that's really what's happening.
We've always standardized things that were already implemented somewhere to show the feasibility first. You can't untag a standard, and implementations are effectively the CI for a standard. If the distribution project doesn't want to add it, or maintain a feature branch, then we'd need another project.
Registries falling back when the header isn't seen is probably easy enough. That needs to be to a single registry to avoid a dependency confusion attack. The fallback for new clients talking to older registries is less clear. The API needs to ensure new clients either know to fallback when talking to older proxies, or that the API doesn't exist on old proxies to avoid the security risk of sending requests to the wrong upstream registry.
With Docker updating that integration, the single proxy setting to Hub will hopefully stop being an issue in the future. |
If you need an example of the The other solution that I have seen used out in the wild is to add a header to the mirror configuration so that the receiving registry is aware of the original registry the image originates from. This is how Drangonfly solves the problem for example. Which means that they would not be able to support wild card mirror configuration. https://d7y.io/docs/next/operations/integrations/container-runtime/containerd/#multiple-registries Right now every registry out there seems to come up with their own standard for how to deal with mirroring multiple namespaces. A common practices is to to join the original image name reference with the mirror registry. For example |
Can this be used more generically? Not just for image pulls from a proxy, but also for tag listings, referrers, possibly even a push-through proxy? Is there a way users can directly reference content on a proxy with a namespace parameter, e.g. an added field in the image reference like |
As a real-world example CIRC, introduced in this KubeCon session, also uses the |
Yeah I think it's safe to say that ns has become a de facto standard despite this PR stalling out. |
e57f6f2
to
669cad7
Compare
Define repository namespace query parameter for proxying. Signed-off-by: Derek McGowan <[email protected]>
669cad7
to
68508bd
Compare
If the registry ignores a |
This is not a problem that needs to be solved here. Tag mutability, and figuring out what a tag points at when pulling from different registries or mirrors, is a known issue. If this matters to you, the correct solution is to use digests instead of tags. Beyond that, it doesn't matter where the content comes from. Allowing the client to make decisions based on where a registry reports that it is getting its content from violates separation of concerns. Pull a digest instead of a tag, and you're guaranteed to get what you want. Any hacks around "trusting" content because the server tells you it got it from a specific location, are insecure by design. |
I completely agree to using digests and signing for security, even when no proxies are involved. But we have a scenario where it's less the client trusting the proxy, and more the proxy telling the client "you asked for X and I'm choosing to give you Y instead". Or we could say proxies should not do that. Or we can keep the current language that says a proxy is within the spec to return different content than what the client requested without any notification to the client. Of the three options, another header to give the client some feedback seems the least intrusive and most flexible to implementations. |
I think making the client at all aware of where the request is being served from is problematic. Say I have local registry A acting as pull-through cache that runs alongside my cluster. That is backed by an organization-level registry B with only approved images, which are populated from Docker Hub via a manual sync process that requires approval. If I ask for an image from Docker Hub from either of these registries, with And why should the client care at all, if the only right way to ensure you get what you want is to use the content digest? The addressable content digest is the integrity check. Anything else is theater. |
@sudo-bmitch I understand your point but what is the logic you are considering there. From a client perspective, header is included and matches...good, header is not included...also good? Header is included and wrong...fail with what error. From the registry perspective, if it supported and has the content, return header and content, if it is supported and does not have the content, then 404. Either way, 404 is the way for a registry to return it understands the request and does not have the content. With proxies, it is more important for clients to understand the proxies they are communicating with and ensure that proxy is trusted. Either trusted proxy, signed content, or content by digest is the only way we should encourage these use cases. |
I think different tools may have their own logic, but the version I'm considering is:
This could be a user configurable behavior, not unlike how TLS verification is configurable in most tools. The value add to me is that there are registries that merge content from multiple upstream sources into a single global namespace. If that registry is used as a proxy, a header would make it possible to detect that content is being returned from a potentially malicious squatter on a repository path that happens to be a different mirror than the expected upstream source. Either the registry wouldn't return a header and tooling should assume it only proxies a single registry, or it should return a header indicating that the content came from a different upstream than expected. Having many different types of proxies, from the pull through cache of a single registry, a manual mirror of approved content, a mash up from multiple upstream sources, and proxies that understand and use the new |
I can only speak for how Spegel implements resolving tags with the ns parameter. The mirror registry should include the registry as part of the tag resolve process to avoid any name squatting. If a mirror merges multiple upstream registries it should only resolve the tag if the full registry, repository and tag matches. That is at least how Spegel implements this. As for the trust aspect, I think the same rules apply as they do today. The responsibility is on the end user to use a registry that they trust. There is nothing stopping a bad acting registry from returning whatever digest it likes to the client. It does not really matter if it is a mirror or not. |
Signed-off-by: Brandon Mitchell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on the Thursday call discussion, I added a commit with the OCI-Namespace
header. If that's blocking other maintainers from approving, I can split that into a separate PR.
Changes were addressed followed by a LGTM comment.
Define repository namespace query parameter for proxying.
Closes #12
Giving time for registry operators to weigh in
Maintainer approval