Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(chunkv5): Chunk V5 structure, encoding and decoding #14674

Draft
wants to merge 57 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
1c1e6f8
Add a new chunk and block format
shantanualsi Oct 25, 2024
b436be9
Refactor to fetch CompressedBlock from source
shantanualsi Oct 25, 2024
9ad2d51
Add implementation for organised head format
shantanualsi Oct 25, 2024
d2b8993
Add iterator placeholders
shantanualsi Oct 28, 2024
4c5dee6
implement sample and entry iterators
shantanualsi Oct 28, 2024
402da98
Fix read timestamps
shantanualsi Oct 28, 2024
278d2ea
Fix test block
shantanualsi Oct 29, 2024
ad5cdff
Fix writing and reading chunks with the new block format
shantanualsi Oct 30, 2024
0b99c97
Add temporary skip for corrupt chunk test
shantanualsi Oct 30, 2024
dffc234
Skip failing test for now
shantanualsi Oct 30, 2024
14fbae3
format
shantanualsi Nov 4, 2024
865fe99
Skip failing checkpointing test for now
shantanualsi Nov 4, 2024
33b3f19
Add an evaluator mode to skip lines
shantanualsi Nov 7, 2024
1582785
Fix existing tests
shantanualsi Nov 7, 2024
bc2c21d
Support both v4 and v5 chunk formats in storage hack (data gen)
shantanualsi Nov 13, 2024
6428e9a
Add missing schema entries
shantanualsi Nov 13, 2024
8ba5765
A bit of additional optimization
shantanualsi Nov 13, 2024
43986b3
Add a new chunk and block format
shantanualsi Oct 25, 2024
840e0fb
Refactor to fetch CompressedBlock from source
shantanualsi Oct 25, 2024
3aa3d48
Add implementation for organised head format
shantanualsi Oct 25, 2024
93f6b39
Add iterator placeholders
shantanualsi Oct 28, 2024
0159953
implement sample and entry iterators
shantanualsi Oct 28, 2024
e76c2d1
Fix read timestamps
shantanualsi Oct 28, 2024
6bf6c34
Fix test block
shantanualsi Oct 29, 2024
d51082a
Fix writing and reading chunks with the new block format
shantanualsi Oct 30, 2024
e7b1d0e
Add temporary skip for corrupt chunk test
shantanualsi Oct 30, 2024
a552edd
Skip failing test for now
shantanualsi Oct 30, 2024
2249c44
format
shantanualsi Nov 4, 2024
ceebc86
Skip failing checkpointing test for now
shantanualsi Nov 4, 2024
e2c57c8
Fix lint
shantanualsi Nov 5, 2024
897d899
Add documentation
shantanualsi Nov 5, 2024
02e651d
Merge branch 'chunkv5-query' into chunkv5
shantanualsi Nov 13, 2024
21b4826
Populate stats while decoding
shantanualsi Nov 13, 2024
7459ab0
Add test to benchmark storage
shantanualsi Nov 13, 2024
0e97be3
Fix documentation
shantanualsi Nov 13, 2024
8309fff
Add initial benchmark results to doc
shantanualsi Nov 13, 2024
c3ee869
Improve doc
shantanualsi Nov 13, 2024
955e67c
Merge branch 'main' into chunkv5
shantanualsi Nov 13, 2024
a7d36c5
Add a micro benchmark
cyriltovena Nov 14, 2024
a72455a
Merge branch 'main' into chunkv5
shantanualsi Nov 18, 2024
42f33d8
Merge branch 'main' into chunkv5
shantanualsi Nov 22, 2024
eae46c0
Use no compression for structured metadata and timestamps
shantanualsi Nov 22, 2024
487691c
Merge branch 'main' into chunkv5
shantanualsi Nov 25, 2024
a7ce939
Simplify buffer usage in chunk v5
shantanualsi Nov 25, 2024
bcf3608
Merge branch 'main' into chunkv5
shantanualsi Nov 26, 2024
4da2348
Fix failing querier test
shantanualsi Nov 26, 2024
b4f438d
Move logic to query metrics to extractor
shantanualsi Nov 26, 2024
97af5f0
Remove unused functions
shantanualsi Nov 26, 2024
45764fd
Fix tests
shantanualsi Nov 26, 2024
5c14ef0
Simplify serializing block
shantanualsi Nov 26, 2024
2f43079
Merge branch 'main' into chunkv5
shantanualsi Nov 28, 2024
e8cd919
Merge branch 'main' into chunkv5
shantanualsi Dec 20, 2024
d266a7c
Merge branch 'main' into chunkv5
shantanualsi Jan 6, 2025
0c9949f
Merge branch 'main' into chunkv5
shantanualsi Jan 15, 2025
8ec0281
Merge branch 'main' into chunkv5
shantanualsi Jan 20, 2025
85e3eac
Merge branch 'main' into chunkv5
shantanualsi Jan 29, 2025
b0e73cf
Merge branch 'main' into chunkv5
shantanualsi Jan 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions pkg/chunkenc/chunk-v5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Organized Chunk Format Documentation (WIP)
## Overview

The organized head format (represented by the format version V5/ChunkFormatV5) is a new storage format that separates log lines, timestamps, and structured metadata into distinct sections within a compressed block to enable more efficient querying. This format aims to improve performance by organizing data in a way that minimizes unnecessary decompression when only specific fields are needed.

## Block Structure Diagram

```
┌─────────────────────────────────────────┐
│ Compresssed Block (for Chunk V5) │
├─────────────────────────────────────────┤
│ Log Lines Section │
│ ┌─────────────────────────────────────┐ │
│ │ Length │ │
│ │ Compressed Log Lines │ │
│ │ Checksum │ │
│ └─────────────────────────────────────┘ │
│ Structured Metadata Section │
│ ┌─────────────────────────────────────┐ │
│ │ Length │ │
│ │ Compressed Metadata Symbols │ │
│ │ Checksum │ │
│ └─────────────────────────────────────┘ │
│ Timestamps Section │
│ ┌─────────────────────────────────────┐ │
│ │ Length │ │
│ │ Compressed Timestamps │ │
│ │ Checksum │ │
│ └─────────────────────────────────────┘ │
│ │
│ Block Metadata Section │
│ ┌─────────────────────────────────────┐ │
│ │ Number of Blocks │ │
│ │ Block Entry Count │ │
│ │ Min/Max Timestamps │ │
│ │ Offsets & Sizes │ │
│ │ Checksum │ │
│ └─────────────────────────────────────┘ │
│ │
│ Section Offsets & Lengths │
└─────────────────────────────────────────┘
```

## Section Details

1. **Log Lines Section**
- Contains the actual log message content
- Each entry prefixed with its length (varint encoded)
- Compressed using the configured compression algorithm
- Format: `len(line1) | line1 | len(line2) | line2 | ...`

2. **Structured Metadata Section**
- Stores label key-value pairs using a symbol table
- Each entry contains the count of symbol pairs followed by the pairs
- Symbol pairs are stored as integer references to the symbol table
- Format: `section_len | num_symbols | (symbol_ref_name, symbol_ref_value)*`

3. **Timestamps Section**
- Contains entry timestamps in chronological order
- Timestamps are varint encoded
- Compressed independently of other sections
- Format: `timestamp1 | timestamp2 | ...`

## Implementation Components

### Key Structures

```go
type organisedHeadBlock struct {
unorderedHeadBlock
}
```

Extends the unordered head block with organized storage capabilities.

### Main Methods

1. **Serialization Methods**
```go
// Serializes log lines section
func (b *organisedHeadBlock) Serialise(pool compression.WriterPool) ([]byte, error)

// Serializes structured metadata section
func (b *organisedHeadBlock) serialiseStructuredMetadata(pool compression.WriterPool) ([]byte, error)

// Serializes timestamps section
func (b *organisedHeadBlock) serialiseTimestamps(pool compression.WriterPool) ([]byte, error)
```

2. **Iterator Implementation**
```go
type organizedBufferedIterator struct {
// Separate readers for each section
reader io.Reader // for log lines
smReader io.Reader // for structured metadata
tsReader io.Reader // for timestamps
// ... other fields
}
```

## Query Plan Considerations

The organized format enables several potential query optimizations on queries with vector aggregation on structured metadata:
It can read only the structured metadata section and avoids decompressing log lines and timestamps.

## Implementation Notes

- The format maintains backwards compatibility with existing unordered head blocks
- Each section is independently compressed, allowing for section-specific optimization
- The symbol table approach in structured metadata reduces memory usage for repeated labels

-----------------

# Iteration 1: Initial Benchmark Results and Analysis

## Key Findings thus far
1. **Positive Results**
- Significant in total decompressed bytes (as lines are not decompressed at all)
- Successful selective decompression (no lines decompressed for sample queries)

2. **Areas for Improvement**
- V5 is 12% slower execution time in Chunk V5. This needs to be definitely optimized further.
- Higher memory usage (~1.6GB increase)
- Increased allocation count (possibly due to use of multiple buffers for TS, line and metadata)

## Next Steps

- [ ] **Performance Optimization**
- Profile + compare memory usage to identify causes of increased allocation
- Optimize metadata section compression/decompression
- Investigate potential buffer reuse strategies
- Consider adding memory pools for common operations (?)

- [ ] **Query Path Enhancements**
- Do not read lines only when the vector aggregation is on structured metadata. Still evaluating how do do this in code.


- [ ] **Testing**
- Verify if st.stats are reported properly
- Add more comprehensive benchmark scenarios
- Test with various query patterns and load on dev environment
- Measure impact of different compression settings

The initial results show promise in terms of data organization and selective access, but a lot of further optimization is needed to address performance overhead.
Loading