The S3 Input Source provides a seamless way to utilize data stored in S3 or any S3-compatible storage service as input for Bacalhau jobs. Users can specify files or entire prefixes stored in S3 buckets to be fetched and mounted directly into the task's execution environment. This capability ensures that your tasks have immediate access to the necessary data.
Here are the parameters that you can define for an S3 input source:
- Bucket
(string: <required>)
: The name of the S3 bucket where the data is stored. - Key
(string: <optional>)
: The object key or prefix within the bucket. Supports trailing wildcard for fetching multiple objects with matching prefixes. - Filter
(string: <optional>)
: A regex pattern to filter the objects to be fetched. If a Key is also provided as a prefix, the filter pattern will be applied to object keys after the prefix. - Region
(string: <optional>)
: The AWS region where the S3 bucket is hosted. - Endpoint
(string: <optional>)
: The endpoint URL of the S3 or S3-compatible service. - VersionID
(string: <optional>)
: The specific version of the object if versioning is enabled on the bucket. Only applicable when fetching a single object, and not a prefix or a pattern of objects. - ChecksumSHA256
(string: <optional>)
: The SHA-256 checksum of the object to ensure data integrity. Only applicable when fetching a single object, and not a prefix or a pattern of objects.
- Single Object: If the key points to a single object, that object is fetched and made available to the task. e.g.
s3://myBucket/dir/file-001.txt
- Prefix Matching: If the key ends with a slash (/), it's interpreted as a prefix, and all objects with keys that start with that prefix are fetched, mimicking the behavior of fetching all objects in a "directory". e.g.
s3://myBucket/dir/
- Wildcard: Supports a trailing wildcard (
*
). All objects with keys matching the prefix are fetched, facilitating batch processing or analysis of multiple files. e.g.s3://myBucket/dir/log-2023-09-*
When using the Bacalhau YAML configuration to define the S3 input source, you can employ the following declarative approach.
Below is an example of how to define an S3 input source in YAML format.
InputSources:
- Source:
Type: "s3"
Params:
Bucket: "my-bucket"
Key: "data/"
Endpoint: "https://s3.us-west-2.amazonaws.com"
ChecksumSHA256: "e3b0c44b542b..."
- Target: "/data"
When using the Bacalhau CLI to define the S3 input source, you can employ the following imperative approach. Below are example commands demonstrating how to define the S3 input source with various configurations:
-
Mount an S3 object to a specific path:
bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path ubuntu ...
-
Mount an S3 object with a specific endpoint and region:
bacalhau docker run -i src=s3://bucket/key,dst=/my/input/path,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...
-
Mount an S3 object using long flag names:
bacalhau docker run --input source=s3://bucket/key,destination=/my/input/path ubuntu ...
With these commands, you can seamlessly fetch and mount data from S3 into your task's execution environment directly through the CLI.
To support this storage provider, no extra dependencies are necessary. However, valid AWS credentials are essential to sign the requests. The storage provider employs the default credentials chain to retrieve credentials, primarily sourcing them from:
- Environment variables: AWS credentials can be specified using
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
environment variables. - Credentials file: The credentials file typically located at
~/.aws/credentials
can also be used to fetch the necessary AWS credentials. - IAM Roles for Amazon EC2 Instances: If you're running your tasks within an Amazon EC2 instance, IAM roles can be utilized to provide the necessary permissions and credentials.
For a more detailed overview on AWS credential management and other ways to provide these credentials, please refer to the AWS official documentation on standardized credentials.
Compute nodes must run with the following policies to support S3 input source:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::BUCKET_NAME"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": "arn:aws:s3:::BUCKET_NAME/*"
}
]
}
- ListBucket Permission: The
s3:ListBucket
permission is necessary to list the objects within the specified S3 bucket, allowing prefixes and wildcard expressions as the S3 Key for fetching. - GetObject and GetObjectVersion Permissions: The
s3:GetObject
ands3:GetObjectVersion
permissions enable the fetching of object data and its versions, respectively. - Resource: The
Resource
field in the policy specifies the Amazon Resource Name (ARN) of the S3 bucket. The/*
suffix is necessary to allow fetching of all objects within the bucket or can be replaced with a prefix to limit the scope of the policy. You can also specify multiple resources in the policy to allow fetching from multiple buckets, or*
to allow fetching from all buckets in the account.
For more information on IAM policies specific to Amazon S3 buckets and users, please refer to the AWS documentation on Using IAM Policies with Amazon S3.
This feature isn't limited to AWS S3 - it supports all S3-compatible storage services. It means you can pull data from the likes of Google Cloud Storage and open-source solutions like MinIO, giving you the flexibility to utilize a diverse range of data sources.
To seamlessly integrate Google Cloud Storage with Bacalhau, follow these steps:
- Obtain HMAC Keys: To access Google Cloud Storage, you'll need HMAC (Hash-based Message Authentication Code) keys. Refer to the Google Cloud documentation for detailed instructions on creating a service account and generating HMAC keys.
- Provide HMAC Keys to Bacalhau: You can provide the HMAC keys to Bacalhau using the same options as AWS credentials, as documented in the Credential Requirements section.
- Configure the S3 Input Source: In your S3 input source configuration, set the endpoint for Google Cloud Storage to
https://storage.googleapis.com
, as shown in the example below:
InputSources:
- Source:
Type: "s3"
Params:
Bucket: "my-bucket"
Key: "data/"
Endpoint: "https://storage.googleapis.com"
- Target: "/data"