Skip to content

Conversation

yuchen-db
Copy link
Collaborator

@yuchen-db yuchen-db commented Mar 29, 2025

private PR: https://github.com/databricks-eng/universe/pull/999928
Add birthstone to mark the completeness of a block upload to cloud bucket. So the bucket looks like
429619281-df0f5366-6671-4109-9091-aee6364002a8

This could help with metaSync which happens every minute.
Comparing number of API calls of different listers during syncMeta

  1. RecursiveLister: total_number_of_blocks*files_per_block/item_per_page LIST
  2. ConcurrentLister: total_number_of_blocks/item_per_page LIST and total_number_of_blocks HEAD
  3. BirthstoneLister: 2*total_number_of_blocks/item_per_page LIST
    item_per_page is a constant given by cloud provider, it defines the maximum number of items returned from a single LIST, usually 5000

If we have 5000 blocks and 10 files per block (chunks, meta, index), the three numbers will be

  1. RecursiveLister: 10 LIST
  2. ConcurrentLister: 1 LIST and 5000 HEAD
  3. BirthstoneLister: 2 LIST

@@ -356,6 +356,129 @@ func (f *ConcurrentLister) GetActiveAndPartialBlockIDs(ctx context.Context, ch c
return partialBlocks, nil
}

// ShadowLister lists block IDs cheap and fast replying on shadow meta files.
type ShadowLister struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might enable this as a new strategy similar to recursive and concurrent

@@ -40,6 +40,8 @@ const (

// DebugMetas is a directory for debug meta files that happen in the past. Useful for debugging.
DebugMetas = "debug/metas"
// ShadowMetaDirname is the directory name for shadow meta files.
ShadowMetaDirname = "shadow-meta"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other files are under 1 block folder, but my understanding is you have

/BLOCK_1/meta.json
/BLOCK_2/meta.json
/BLOCK_3/meta.json
/shadow-meta/ some meta?

@yuchen-db yuchen-db changed the title add shadow lister in fetcher package Introduce birthstone and BirthstoneLister Mar 31, 2025
@yuchen-db yuchen-db requested review from jnyi and hczhu-db April 2, 2025 18:14
@yuchen-db yuchen-db force-pushed the db_main branch 3 times, most recently from ca8c010 to 28db9f6 Compare June 5, 2025 01:18
@yuchen-db yuchen-db force-pushed the db_main branch 5 times, most recently from 64fbb3a to 5276dd1 Compare June 5, 2025 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants