Initial implementation allow sharing of items of buckets with versioning enabling #22

achtsnits · 2024-09-13T20:05:00Z

bucket versioning enables tracking of multiple item versions over time and retains deleted items with a specific marker flag -> this feature impacts storage usage and does not fully address reproducibility challenges, as it operates only at the single-object level

with bucket version enabled:

requesting a single object with such additional version identifier retrieves that specific version
the standard bucket list operation with a prefix displays only the latest state and most recent versions of each object
an extended list operation allows to view all versions of all objects, also including deleted objects, while you can go back in time that way it is quite cumbersome and unwieldy

signoff:

document and explain problem which will be solved and what are benefits and shortcomings
implement a first prototype allowing to share versioned item(s)
outline/align/agree how workspace should be evolved, also in regard to Stable listing of shared items #27

achtsnits · 2024-11-12T06:54:06Z

Bucket versioning, available in many object storage solutions like AWS S3 and Minio, enables users to track multiple versions of an item over time and flags deleted items with a specific marker, see example below:

Imagine a file called 1.tif is uploaded to an object storage bucket. The first version of the file is saved with the name 1.tif and a version identifier v1. Later, the file is updated, and the new version is saved as 1.tif with a version identifier v2. If the file is deleted, it is flagged with a delete marker. If versioning is enabled, all these versions are retained, and users can access or restore any previous version of 1.tif (v1, v2, or the delete marker) using its respective version identifier.

From a user's perspective, they may choose to retrieve just the latest version of 1.tif (v2, in this case) by simply requesting 1.tif. However, if they need a specific earlier version, such as v1, they can pass the version identifier v1 together with 1.tif in the same s3 GetObject request to retrieve that specific version of the file.

Anyway, versioning presents several challenges for reproducibility and storage management:

higher costs with large data sets: versioning adds new versions without automatically removing old ones, causing older versions to accumulate and quickly increase storage costs. Manual cleanup is required to prevent storage bloat and control costs.
no automatic expiration: old versions do not expire by default. Object storage solutions often provide "lifecycle policies" for retention and automated deletion, but they are proprietary and work on a single-object level (see next item for more details):
- transition: move older versions to lower-cost storage (e.g., S3 Glacier or Infrequent Access) after a set period
- expiration: automatically delete older versions after a defined time to manage storage bloat
- delete markers: automatically remove delete markers after a set period, fully clearing deleted items
- max number of object versions: set a limit on the number of versions retained for each object, deleting older versions once the limit is reached
complexity with filtering: while AWS offers prefix and tag-based filtering to apply policies to groups of objects, this functionality is proprietary, complex, and limited to a single level of hierarchy. It does not support custom user-defined groupings (e.g., a specific user-curated set of shared items).
complex listing options: standard bucket listing commands only show the latest version of each object, making it difficult to access historical versions without special requests. The extended listing for versions is more complex and slower to process.

Despite these challenges, ensuring reproducibility requires not only an explicit index of all shared items for consistent object listings (see issue #27), but also guaranteeing that shared objects remain unchanged. While checksums can detect changes on the client side, object versioning allows for proper management on the server side. Without this, it is the user's responsibility to prevent accidental overwriting of data on the same object key.

Initial Implementation concept

To ensure consistent object listings, we already identified the need to create an index (represented through symbolic links on the workspace side) for a user-curated set of shared items. This set can be explicitly defined by the user or automatically include all sub-items if the user points to a prefix=folder. When generating the index, we verify if versioning is enabled on the "source" bucket. If it is, the version identifier should be included in the index. This will allow the retrieval of the specific object version when the index is resolved.

achtsnits added BR73 Storage Management BR74 Storage Buckets Provisioning and Association BR80 Storage Identifier labels Sep 13, 2024

achtsnits mentioned this issue Sep 29, 2024

Stable listing of shared items #27

Open

achtsnits self-assigned this Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation allow sharing of items of buckets with versioning enabling #22

Initial implementation allow sharing of items of buckets with versioning enabling #22

achtsnits commented Sep 13, 2024 •

edited

Loading

achtsnits commented Nov 12, 2024

Initial implementation allow sharing of items of buckets with versioning enabling #22

Initial implementation allow sharing of items of buckets with versioning enabling #22

Comments

achtsnits commented Sep 13, 2024 • edited Loading

achtsnits commented Nov 12, 2024

Initial Implementation concept

achtsnits commented Sep 13, 2024 •

edited

Loading