Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: have a list of live blocks in object storage #7710

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MichaHoffmann
Copy link
Contributor

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

@MichaHoffmann MichaHoffmann changed the title [DRAFt] proposal: have a list of live blocks in object storage [DRAFT] proposal: have a list of live blocks in object storage Sep 9, 2024
@MichaHoffmann MichaHoffmann force-pushed the mhoffmann/draft-for-live-blocks-list-in-object-storage branch 4 times, most recently from a73ba7c to c93235d Compare September 9, 2024 14:22
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, some comments!

@MichaHoffmann MichaHoffmann force-pushed the mhoffmann/draft-for-live-blocks-list-in-object-storage branch from c93235d to a884a0b Compare September 10, 2024 12:30
@MichaHoffmann MichaHoffmann changed the title [DRAFT] proposal: have a list of live blocks in object storage proposal: have a list of live blocks in object storage Sep 10, 2024
@MichaHoffmann MichaHoffmann force-pushed the mhoffmann/draft-for-live-blocks-list-in-object-storage branch from a884a0b to 3329161 Compare September 11, 2024 16:40

## Why

Accessing blocks from object storage is generally orders of magnitude slower then accessing the same data from other sources that can query them from disk or from memory. If the source that originally uploaded the blocks is still alive and still has access to the block then we would throw away the expensively obtained data from object storage during deduplication. Note that this will not put additional pressure on the source components since they would get queried during fan-out anyway. As an example, imagine a Sidecar to a Prometheus server that has 3 months of retention and a Storage Gateway that is configured to serve the whole range. Right now we dont have a great way to deal with this dynamically. This proposal should address this hopefully.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is somewhat of a way to deal with this through --min-time and --max-time flags but it is not ideal:

  • What if some Sidecar/Receiver fails to upload blocks? Receiver doesn't delete blocks until they are uploaded so it might accidentally fall out of the configured min/max time.
  • It almost requires all components (Receive/Sidecar/Ruler) to have equal retention to make --max-time work properly; otherwise, you might have to have multiple Thanos Store replicas with different --max-time and selector labels.

We would like to improve this user experience. I would add this to the Why section.


### Solution

Each source utilizes the `shipper` component to upload blocks into object storage. Additionally the `shipper` component also has a complete picture of the blocks that the source owns. If we extend the `shipper` to also own a register in object storage named `thanos/sources/<uuid>/live_blocks.json` that contains a plain list of the live blocks that this source owns. We can deduce a timestamp when it was last updated by checking attributes in object storage. When the storage gateway syncs its list of block metas, it can also iterate the `thanos/sources` directory and see which `live_blocks.json` files have been updated recently enough to assume that the sources are still alive. It can subsequently build an index of live block ids and proceed to prune them when it handles Store API requests (Series, LabelValues, LabelNames RPCs). In theory this should not lead to gaps since the block_ids are still owned by other live sources. Care needs to be taken to make sure that the blocks are within the `--min-time/--max-time` bounds of the source. The UUID for the source should be propagated in the `Info` gRPC call to the querier and through `Series` gRPC call into the storage gateway. This enables us to only prune sources that are still alive and registered with the querier. Note that this is not a breaking change - it should only be an opt-in optimization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest creating a separate subsystem/entity that would use the Shipper to upload thanos/sources/<uuid>/live_blocks.json instead of hooking the logic inside of the Shipper. The reason is that we have external projects using the Shipper.


1. How do we obtain a stable UUID to claim ownership of `thanos/sources/<uuid>/live_blocks.json`?

I propose that we extend Sidecar, Ruler and Receiver to read the file `./thanos-id` on startup. This file should contain a UUID that will identify it. If this file should not exist we generate a random UUID and write this file, which should give us a reasonably stable UUID for the live time of this service.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this would also help us solve #6939. I propose adding this to Thanos Store too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lets just add it to all components!


3. What happens if a block was deleted due to retention in Prometheus but shipper has not uploaded a new `live_blocks.json` file yet?

We shall only filter blocks from the `live_blocks.json` list with a small buffer depending on the last updated timestamp. Since this list is essentially a snapshot of the blocks on disk, any block deleted because of retention will be deleted after the this timestamp. Any blocks whose range overlaps `oldest_live_block_start - (now - last_updated)` could be deleted because of retention, so they should not be pruned. Example: If we were updated 1 hour ago we should not filter the oldest block from the list. If we were updated 3 hours ago we should not filter the last 2 blocks, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include the timestamp inside of that file to avoid having to do an extra HEAD call? AFAICT that would be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought maybe HEAD would be omre accurate because of possible clockskew but lets just write timestamp into the file.


* We tend to do tons of `Series` calls and a bloomfilter for a decently sized bucket of 10k blocks could be enormous. Additionally the live blocks dont really change often so updating it on every `Series` call seems unnecessarily expensive.

2. Shared Redis/Memcached instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also worth mentioning why we don't use a filter here and opt for a JSON format with no compression - risk of false positives, UUIDs are essentially random so they don't compress well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants