Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config: support for multiple S3 locations #14

Open
ppanero opened this issue Jan 23, 2020 · 7 comments
Open

config: support for multiple S3 locations #14

ppanero opened this issue Jan 23, 2020 · 7 comments
Assignees

Comments

@ppanero
Copy link
Member

ppanero commented Jan 23, 2020

Currently only one bucket and one endpoint is supported. Some use cases require multiple buckets/endpoint URLs. Config should be changed for that.

@wgresshoff wgresshoff self-assigned this Jan 24, 2020
@wgresshoff
Copy link
Contributor

I start looking into how configuration works. Then I can estimate the time I need to commit.

@wgresshoff
Copy link
Contributor

After looking into how invenio-s3 and invenio-files-rest are actually implemented I'm coming to the conclusion that it's not that easy to implement. I see to targets to achieve:

  • it should be possible to configure one S3 endpoint per defined s3 location
  • there should be a default configuration for s3 locations without a configuration of it's own

The problem I see is: there is no obvious relation between a location and the files configuration at all. That's surely not a problem, if the location is just a directory anywhere in the file system, but to define a s3 endpoint you need at least the url of the s3 server and a secret. The storage factory creates the storage class just by the fileurl.

I think that should be refactored by adding at least the name of the location to have a key to distinguish the s3 configurations properties. An s3 endpoint url per location would the be configured by the property invenio_s3.config.S3_ENDPOINT_URL.location_name (with the fallback invenio_s3.config.S3_ENDPOINT_URL if there is just one s3 endpoint). The other configuration options would be similar.

@egabancho
Copy link
Member

@ppanero @wgresshoff what about moving the endpoint to the location URI? (I probably thought about this at some point)
Having something like s3://myserver.com/b1, the only open question I have would be the default base URL, i.e. if one uses AWS S3 you don't actually need to specify the URL of the server, it's already set internally by the boto3 library.

I don't really like the idea of adding a configuration variable to solve this because it'll add complexity into the location creation, which we want to avoid, plus I see this like the kind of thinkg that I would forget and then wonder for a day or two why my files are not in the right place 😂

@wgresshoff
Copy link
Contributor

@egabancho @ppanero I need to clarify my idea: I would like to add the location name in the parameter list of the storage factory, so the storage knows which configuration to use. The location would not be changed. The base URL itself without credentials won't help (there could be different credentials for the same server URL with different storage prices).

@egabancho
Copy link
Member

You are definitely right, URL without credentials ... not gonna work.
Probably you already discuss this with @ppanero but, could you put here an example of how this configuration would look like? I'd really help understand what you have in mind ☺️

@ppanero
Copy link
Member Author

ppanero commented Feb 10, 2020

@egabancho we did not really sketch how it would look at that point. I thought of having something like for ES, where you can specify multiple hosts, each with its credentials.

However, I think we would end up in the same issue @wgresshoff is mentioning, and need to change the files factory.

@wgresshoff
Copy link
Contributor

wgresshoff commented Mar 20, 2020

Sorry, I needed some time, but finally...
Ok, the example: if there are two locations defined, say names are amazon_aws and cephfs the configuration would look like this:
S3_ENDPOINT_URL.amazon_aws = https://amazon.com
S3_ENDPOINT_URL.cephfs = https://ceph.com
S3_ACCESS_KEY_ID.amazon_aws = xyz
S3_ACCESS_KEY_ID.cephfs = abc
S3_SECRET_ACCESS_KEY.amazon_aws = sdsdsdsds
S3_SECRET_ACCESS_KEY.cephfs = abcabdafah

And finally there might be some default configuration (which would surely lead to some nice errors if forgotten) as fallback:
S3_ENDPOINT_URL = https://default.com
S3_ACCESS_KEY_ID = ghz
S3_SECRET_ACCESS_KEY = lkhjsdafkjhfdskjh

So everywhere the config is consulted the location_name should be known. This leads to some more code in invenio-s3 but a function to read the configuration is rather simple to implement (and only needed in invenio-s3).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants