Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HTTP, HTTPS schemes as well as whatever Go CDK supports for blob storage #1280

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

a7g4
Copy link

@a7g4 a7g4 commented Nov 29, 2024

Changelog

Support HTTP, HTTPS schemes as well as whatever Go CDK supports for blob storage (s3, gcs, azure).

Docs

https://mcap.dev/guides/cli#remote-file-support was updated to reflect that s3, azure and http(s) are supported.

Description

Currently the mcap CLI only supports remote files stored in Google Cloud Storage. This change allows users to operate on files stored in S3 (and S3 compatible services), Azure as well as plain HTTP(S).

I manually tested with some mcap files stored on S3, B2 (an S3 compatible service) and GCS.

This is basically a resurection of #957

I also removed some duplicated code between utils.GetReader and utils.WithReader.

Go isn't a language I am super familiar with so let me know if anything un-idiomatic.

@a7g4 a7g4 requested a review from james-rms as a code owner November 29, 2024 05:09
@a7g4 a7g4 force-pushed the blob_storage branch 2 times, most recently from 8c6a52a to b2ab8ad Compare November 30, 2024 04:32
@a7g4
Copy link
Author

a7g4 commented Nov 30, 2024

I realised the docs were part of this repo so I updated website/docs/guides/cli.md but it seems to have made one of the CI checks fail 🫤

@james-rms
Copy link
Collaborator

I tested this out on S3, it works with the auth credentials i have on my desktop.

It's a little sad that this doubles the binary size from 18-> 36MB.

@wkalt do you have any thoughts on this?

@wkalt
Copy link
Contributor

wkalt commented Dec 4, 2024

The binary size increase doesn't seem like a big issue to me. I haven't looked at the patch in depth but it would be useful to understand exactly what the UX is on Azure, S3, and GCS. Specifically, do they all work with CLI flags, or do some of them require environment vars and others not, etc. Looking at the previous patch, the Azure solution would require environment vars. I agree with the author of that patch that this is kind of confusing.

I don't know if that's a dealbreaker or not, but if it is also the case in this patch I think we'd need to,
a) decide if that seems OK
b) add documentation describing exactly how this can be used with Azure and others - more than just asserting support.

I'm +1 on expanding the remote file options but would be wary of inheriting Go CDK's default behavior, unless it aligns with what makes sense for the tool.

@a7g4
Copy link
Author

a7g4 commented Dec 4, 2024

I don't know if that's a dealbreaker or not, but if it is also the case in this patch I think we'd need to,
a) decide if that seems OK
b) add documentation describing exactly how this can be used with Azure and others - more than just asserting support.

Let me set up some buckets in S3, GCS and Azure to compare the CLI invocations.

I'm +1 on expanding the remote file options but would be wary of inheriting Go CDK's default behavior, unless it aligns with what makes sense for the tool.

Do you have a sense of what makes sense for the cli tools? As in, should I err on the side of automatically pull config from well known places (like ~/.aws/config) or explicitly include all config parameters?

FWIW, I only really care about HTTP access, so if adding just HTTP is an easier thing for this PR I can remove the S3/Azure stuff and go back to supporting just GCS (+ HTTP).

On the topic of HTTP, the dependency that I brought in to support HTTP range requests has some behaviour I don't like. It makes a series of 8 byte range requests as mcap.Reader reads the magic, then header etc. which is wasteful in terms of round trips so I am going to re-implement that with a minimum HTTP request size. Should I add that to this PR or a follow up?

@a7g4
Copy link
Author

a7g4 commented Dec 5, 2024

I think there's going to need to be some compromise between consistency in the mcap CLI invocation and consistency with other CLI tools for Azure (if we want to support Azure).

This is what I found doing some testing of various storage providers

GCS

gcloud init
gcloud auth login # needed for: gcloud storage ...
gcloud auth application-default login  # needed for mcap ... gs://...

Then CLI tools and mcap consistently use the credentials stored in ~/.config/gcloud (with the caveat that the gcloud CLI uses the default credentials while mcap uses the Go CDK's default of application-default - https://stackoverflow.com/a/53307505 )

# Authentication stuff is now stored in ~/.config/gcloud
mcap info gs://a7g4/test.mcap # Accessible to all authenticated accounts

This is consistent with other CLI tools which all pull the active config & credentials from ~/.config/gcloud. I don't think there's any way to use env variables to override either the gcloud commands or the CDK (which will use the application-default credentials).

S3/B2/R2/S3 compatible storage

Set up config and credentials with in ~/.aws/config and ~/.aws/credentials. eg.:

$ cat ~/.aws/config
[default]
region = us-east-005
output = jsony
services = b2

[services b2]
s3 =
    endpoint_url = https://s3.us-east-005.backblazeb2.com

Then CLI tools and mcap all consistently obey the [default] config:

aws s3 cp s3://a7g4-test/test.mcap .
mcap info s3://a7g4-test/test.mcap

The CLI tools and mcap also consistently obey ENV overrides:

AWS_ENDPOINT_URL_S3=https://s3.us-east-005.backblazeb2.com aws s3 cp s3://a7g4-test/test.mcap .
AWS_ENDPOINT_URL_S3=https://s3.us-east-005.backblazeb2.com mcap info s3://a7g4-test/test.mcap

This is consistent with other CLI tools which use s3:// schemas which all pull active config & credentials from ~/.aws/config and override them with environment variables

Azure

Other CLI tools which support azblob seem to support at least one of 3 different configuration methods:

  • Named parameters (eg. az storage blob download --account-name a7g4 --container-name test --name test.mcap)
  • Environment variables (eg. AZURE_STORAGE_ACCOUNT=a7g4 az storage blob download --container-name test --name test.mcap)
  • Query params (eg. pulumi login azblob://<container-path>?storage_account=account_name )
    Of these, only environment variables are consistent with other cloud providers so it makes sense to me to leave that as the way to specify Azure config to the mcap CLI. Alternatively, the query parameters would make the second most sense to me:
az login
AZURE_STORAGE_ACCOUNT=a7g4 mcap info azblob://test/test.mcap # This currently works
mcap info azblob://test/test.mcap?storage_account=a7g4 # This is an option

TL;DR

Interactions with Azure storage are very different to S3 and GCS. I propose either:

  1. Leaving the change as-is which would require users to supply Azure configuration in environment variables which matches some Azure CLI tools and is consistent with the way S3 handles configuration overrides
  2. Add the ability to pass through query parameters to the CDK URLOpener which also matches some Azure CLI tools

@wkalt
Copy link
Contributor

wkalt commented Dec 13, 2024

@a7g4 thanks for all that detail. I haven't tested anything but, from your description GCP and AWS both seem good to me.

I like option 1 between the two options in the TLDR, which agrees with your analysis. I think some docs updates are still needed to convey what you've described in the patch.

I also want to clarify I'm not a blocker for this PR, I was just pinged based on previous involvement.

}
default:
return closeReader, nil, fmt.Errorf("unsupported remote file scheme: %s", scheme)
func GetReader(ctx context.Context, uri string) (io.ReadSeekCloser, bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be useful to add a docstring here describing the behavior of the new bool returned

@a7g4
Copy link
Author

a7g4 commented Dec 16, 2024

@a7g4 thanks for all that detail. I haven't tested anything but, from your description GCP and AWS both seem good to me.

I like option 1 between the two options in the TLDR, which agrees with your analysis. I think some docs updates are still needed to convey what you've described in the patch.

I also want to clarify I'm not a blocker for this PR, I was just pinged based on previous involvement.

I'll update the docs (probably with some variation of the test examples I have above) and I'll make the HTTP change I described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants