Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support custom AWS S3 Endpoint #895

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

aminya
Copy link
Contributor

@aminya aminya commented Oct 8, 2024

This PR allows setting the AWS S3 endpoint URL. This allows for using S3-compatible services like Tigris or self-hosting the storage.

https://www.tigrisdata.com/docs/sdks/s3/aws-ruby-sdk/

config/initializers/s3.rb Outdated Show resolved Hide resolved
@ezekg
Copy link
Member

ezekg commented Oct 8, 2024

Can you add some background here? What's the use case? Why is this valuable to you?

@aminya
Copy link
Contributor Author

aminya commented Oct 9, 2024

Can you add some background here? What's the use case? Why is this valuable to you?

This allows using S3-compatible services like Tigris or self-host the storage.

https://www.tigrisdata.com/docs/sdks/s3/aws-ruby-sdk/

config/initializers/s3.rb Outdated Show resolved Hide resolved
@hrueger
Copy link

hrueger commented Nov 22, 2024

Hi all,
we're also very interested in this. We've tried this PR, however, we have the problem that it automatically appends the bucket name as a subdomain layer:
storage.ourserver.com where we host a minio becomes keygen-bucket.storage.ourserver.com... For Minio it would have to be storage.ourserver.com/keygen-bucket` though. Any idea on why that happens?

@aminya
Copy link
Contributor Author

aminya commented Nov 22, 2024

Hi all, we're also very interested in this. We've tried this PR, however, we have the problem that it automatically appends the bucket name as a subdomain layer: storage.ourserver.com where we host a minio becomes keygen-bucket.storage.ourserver.com... For Minio it would have to be storage.ourserver.com/keygen-bucket` though. Any idea on why that happens?

It seems Minio isn't following the standard convention. It works with Tigris

@lukas-runge
Copy link

Hi all, we're also very interested in this. We've tried this PR, however, we have the problem that it automatically appends the bucket name as a subdomain layer: storage.ourserver.com where we host a minio becomes keygen-bucket.storage.ourserver.com... For Minio it would have to be storage.ourserver.com/keygen-bucket` though. Any idea on why that happens?

Using the flag force_path_style: true in the client config fixes this. @aminya maybe add the option to set this flag via env variable. 🙌

config/initializers/s3.rb Show resolved Hide resolved
config/initializers/s3.rb Outdated Show resolved Hide resolved
@aminya
Copy link
Contributor Author

aminya commented Nov 24, 2024

@lukas-runge Great, does this cover your usecase?
f9d387c (#895)

@lukas-runge
Copy link

@lukas-runge Great, does this cover your usecase? f9d387c (#895)

Yes - just tested it. 👍

Ty @aminya! 🙏

Copy link
Member

@ezekg ezekg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, I appreciate the PR, and the discussion from everybody here.

As-is this PR works, but the implementation is not ideal, because it doesn't take into account existing users wanting to migrate from one storage provider to another, e.g. from S3 to Tigris. What I mean is that if an existing Keygen CE user with artifacts already stored in S3 wanted to move from S3 to Tigris, they couldn't do so safely without replicating their entire S3 bucket into Tigris, which can be very costly for large objects. Without the replication, existing artifacts would no longer be accessible, since they would exist in S3 but not Tigris, and the S3 backend now points to Tigris and not S3 (which is confusing unto itself when there is an R2 backend that is also S3-compatible). This also doesn't support scenarios where some accounts want to store artifacts in S3 while others may be ok with Tigris — with this PR, S3 points to Tigris, breaking accounts that want S3.

In the case of Keygen Cloud, when we moved from S3 to R2, we did so seamlessly, without replicating TBs of data from S3 to R2. We were able to do this because existing artifacts remained in S3, while new artifacts were stored in R2. And this is still the case to this day. Some accounts are still on S3, per their internal requirements, while most are on R2. I want the same story for Keygen CE/EE users moving from S3 or R2 to Tigris or Minio.

So with that said —

Rather than introduce customization/overloads for the S3 storage backend, e.g. endpoint and force_path_style (a config that the user has to understand), I'd rather we introduce dedicated Tigris and Minio backends. The Minio client can set force_path_style: true, so that it works out of the box without the user needing to know about that config.

Both new backends could be configured via TIGRIS_ and MINIO_ env vars, just like AWS_ and CF_ are used to configure S3 and R2, respectively. The code is already designed in a way to be extensible, to be able to add TIGRIS and MINIO, and even more storage providers in the future. Each backend would have its own specific environment variables for configuring the client so that it "Just Works."

In addition to the new Tigris and Minio backends, we should add a KEYGEN_STORAGE_BACKEND environment variable to set the default storage backend for new accounts, since there would be more than just the current S3 and R2 backends.

Migrating an existing account would be as simple as running:

account = Account.sole

account.update!(backend: 'TIGRIS')

Existing artifacts, e.g. in R2, would still be accessible, since artifacts have their own backend, and new artifacts would go to Tigris. No replication needed!

Would you be open to adjusting the PR?

@aminya
Copy link
Contributor Author

aminya commented Nov 25, 2024

The idea of having separate classes for each provider sounds good at the first glance, however it doesn't match the original promise of these services. As you see, Tigris simply needs providing the endpoint, and it uses the same AWS environment variables. When you set up a Fly.io Tigris machine, the exact environment variables I used in this code are set automatically by Fly. So if we change these names, it's gonna make the deployment harder not easier.

https://www.tigrisdata.com/docs/sdks/s3/aws-ruby-sdk/

Another thing is that we can't simply find all the S3 providers in the world and provide wrappers for them. For example, it's my first time hearing about Minio.

@ezekg

This comment was marked as outdated.

@ezekg
Copy link
Member

ezekg commented Nov 25, 2024

I understand that's how S3-compatible services work, but what I want to avoid here is env vars like AWS_FORCE_PATH_STYLE_S3 and others like it. It's easier to add support for backends like Minio (which I'm also unfamiliar with) that have the force_path_style baked-in so that the end-user doesn't have to think about it. Like I said, a proper backend is best, because although something may be 'S3-compatible', that doesn't mean it's compatible with the current client options we have for the S3 backend.

I'm fine with introducing a single AWS_ENDPOINT env var (you can try AWS_ENDPOINT first and then fallback to AWS_ENDPOINT_URL_S3), for general compatibility with S3-compatible object stores in the simplest case, but I don't want to introduce anything else configurable via the environment. And like I said, I'd love to see a PR adding proper a Tigris backend that users can migration to and from, in a backwards-compatible way, just like the current S3 and R2 backends.

#895 (comment)

Quick follow up. I changed my mind.

After looking at what other open source projects do, I'm fine with introducing AWS_ENDPOINT (which can fallback to AWS_ENDPOINT_URL_S3) as well as AWS_FORCE_PATH_STYLE environment variables:

AWS_ACCESS_KEY_ID     = ENV['AWS_ACCESS_KEY_ID']
AWS_SECRET_ACCESS_KEY = ENV['AWS_SECRET_ACCESS_KEY']
AWS_BUCKET            = ENV['AWS_BUCKET']
AWS_REGION            = ENV['AWS_REGION']
AWS_ENDPOINT          = ENV['AWS_ENDPOINT'] || ENV['AWS_ENDPOINT_URL_S3']
AWS_FORCE_PATH_STYLE  = ENV['AWS_FORCE_PATH_STYLE'].in?(%w[true t 1])

AWS_CLIENT_OPTIONS = {
  access_key_id: AWS_ACCESS_KEY_ID,
  secret_access_key: AWS_SECRET_ACCESS_KEY,
  region: AWS_REGION,
  endpoint: AWS_ENDPOINT,
  force_path_style: AWS_FORCE_PATH_STYLE,
}.compact_blank
 .freeze

These seem to be the 2 most commonly used environment variables for configuring S3-compatible object stores. For users that want proper support for a Tigris/Minio/etc. backend, they can implement support in a separate PR.

Thanks for the discussion here, and for your patience. Would you be open to updating the PR, @aminya?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants