Skip to content

Conversation

@corylanou
Copy link
Collaborator

Summary

  • Adds PageHeaderFlagCompressedSize flag to indicate compressed size prefix follows page header
  • Encoder now writes 4-byte compressed size after each page header
  • Decoder handles both old (flag=0) and new formats for backward compatibility
  • Updates to lz4 v4.1.23

Problem

The lz4 library v4.1.23 added frame concatenation support per the LZ4 spec. When reading an LZ4 frame, the library now peeks ahead after EOF to check for another concatenated frame. This broke LTX because each page is an independent LZ4 frame with a PageHeader in between - when lz4 peeks, it reads the next PageHeader bytes, sees an invalid LZ4 signature, and errors.

Solution

Add a compressed size prefix to each page, allowing the decoder to create an exact LimitedReader that prevents lz4 from peeking beyond the frame boundary.

New format:

[PageHeader:6][CompressedSize:4][LZ4 Frame]

Old format (still supported for reading):

[PageHeader:6][LZ4 Frame]

The flag is in PageHeader.Flags (not Header.Flags) because pages are read individually in the VFS without easy access to the file header.

Test plan

  • All existing tests pass
  • Tests pass with lz4 v4.1.23
  • Backward compatibility: decoder handles both old and new formats

Fixes #70

🤖 Generated with Claude Code

The lz4 library v4.1.23 added frame concatenation support, which peeks ahead
after reading a frame to check for another concatenated frame. This broke LTX
because each page is an independent LZ4 frame with a PageHeader in between.

This change adds a new PageHeaderFlagCompressedSize flag and writes a 4-byte
compressed size prefix after each page header. The decoder uses this size to
create an exact LimitedReader, preventing lz4 from peeking into the next page.

For backward compatibility, the decoder handles both formats:
- New format (flag set): reads compressed size, uses exact LimitedReader
- Old format (flag=0): uses LimitedReader workaround with lz4 frame footer size

Fixes #70

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@ncruces
Copy link

ncruces commented Jan 12, 2026

This is fine, I guess, but I wonder at this point what is the frame format buying you except overhead?

@corylanou
Copy link
Collaborator Author

Response to @ncruces's question

"This is fine, I guess, but I wonder at this point what is the frame format buying you except overhead?"

This is a valid point worth discussing. Here's the analysis:

What Frame Format Provides (at ~15-27 bytes overhead per page):

  1. Per-page content checksum (~4 bytes) - validates each page's integrity
  2. Magic/descriptor (~7-15 bytes) - self-describing format
  3. EndMark (4 bytes) - signals end of frame

What We Actually Need Now:

Since we're storing the compressed size prefix, we don't need:

  • EndMark (we know the exact size)
  • Magic bytes (we know it's LZ4 from context)

We're really just keeping frame format for the per-page checksum.

The Alternative: LZ4 Block Format

Current:  [PageHeader:6][Size:4][LZ4 Frame with ~15-27 byte overhead]
Block:    [PageHeader:6][Size:4][Raw compressed data]

Block format would:

  • Save ~15-23 bytes per page
  • Use simpler API (CompressBlock/UncompressBlock)
  • Have no frame concatenation issues at all
  • Lose per-page checksums (rely on file-level checksum instead)

The Tradeoff

Aspect PR #72 (Frame + Size) Block Format
Overhead per page ~19-31 bytes 4 bytes
Per-page checksum Yes (LZ4) No
Code complexity Medium Low

For a 4KB page, frame overhead is ~0.5%. For 1KB pages, ~2%.

Question for @benbjohnson

Is the per-page LZ4 checksum valuable enough to keep? LTX already has:

  • File checksum (CRC64 of entire file)
  • Post-apply checksum (rolling checksum of database state)

If file-level checksums are sufficient, we could switch to block format and eliminate the frame overhead entirely. PR #72's compressed size prefix would make that migration straightforward.

Options:

  1. Keep PR Add compressed size prefix to page headers for lz4 v4.1.23 compatibility #72 as-is - Conservative, preserves per-page checksums, can migrate later
  2. Switch to block format now - Cleaner long-term, more code changes, loses per-page checksums

@ncruces
Copy link

ncruces commented Jan 12, 2026

A page checksum might be useful because the VFS reads single pages.

OTOH, uncompressed pages won't have a checksum either (unless you put them inside an uncompressed lz4 frame), so the concern is kinda orthogonal.

@benbjohnson
Copy link
Collaborator

I don't think a checksum buys us much. Are we able to drop the LZ4 frame? I assumed it was needed by the LZ4 library when it reads.

@ncruces
Copy link

ncruces commented Jan 12, 2026

Most of these compression algorithms are layered. There's a block compression layer, then a frame layer.

The block compression works for smallish data of known size. To de/compress one block, you must know how many bytes go in, and more or less provide enough buffer for how many bytes come out. You can think of it as working on arrays/slices/buffers.

The frame layer works on top, can support lots more data, of unknown - a priori - length, and adds headers and trailers with checksums mostly to support streaming. This works by buffering and chunking data, and passing it to the block compression layer.

lz4 uses buffers starting at 64K, so compressing single pages independently can be easily achieved with just the block layer.

But this is a file format change, so you should make an informed decision, and not go with “random guy on the internet.”

@benbjohnson
Copy link
Collaborator

@ncruces The explanation makes sense, thanks. I don't dig into the low level parts of compression libraries but using a single block makes sense given that SQLite pages can't be more than 64KB.

@corylanou
Copy link
Collaborator Author

Superseded by PR #73 which uses LZ4 block format instead of frame format

@corylanou corylanou closed this Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pierrec/lz4 v4.1.23 breaks ltx

4 participants