Add dictionary-in-stream format in contrib #2349

yotann · 2020-10-11T21:36:08Z

Adds a format in contrib/dict_in_stream that stores a dictionary, optionally compressed, in a skippable frame at the beginning of a stream. This format is useful to reduce the size of streams that have many frames. For instance, Wget-AT is using this format to compress WARC files using one frame per record and allow random access.

cc @JustAnotherArchivist

facebook-github-bot · 2020-10-11T21:36:19Z

Hi @yotann!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot · 2020-10-11T22:18:57Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

Cyan4973 · 2020-10-13T05:31:59Z

Thanks @yotann ,
this looks like a very interesting contribution.
It's also great to learn that this format is already implemented and in use.

Adding dictionary support to the seekable format is something we are definitely interested in.
This is probably the place where we will focus most of our attention,
to ensure it can become a base for future developments.

All in all, a very interesting idea worth reviewing.

klauspost · 2021-06-21T08:42:36Z

contrib/dict_in_stream/zstd_dict_in_stream_format.md

+a normal Zstandard stream. All compressed frames in the stream must have been
+compressed using the dictionary.


No dictionary should also be a valid option for frames.

I'm not sure about this. Part of me wants to keep it simple, as an implementation that requires every frame to use the same dictionary may be easier and cleaner than potentially switching between dict and no-dict. On the other hand, other specifications (such as the one for the .warc.zst files produced by Wget-AT) can always impose additional restrictions. And I don't see any obvious harm from allowing dictionary-less frames other than that it might lead to the immediate follow-up 'why not allow multiple dictionaries in a stream?', which – while there may well be realistic applications of it – seems like feature creep to me at this point.

@JustAnotherArchivist Dictionaries already have an ID and Frames should use that ID if they want to. It doesn't really extend any features, it just registers the dictionaries and frames with their ID can pick it up.

That way you can add several dictionaries, for example for different file types and the encoder can use them as it sees fit. While the improvement will be marginal it gives some flexibility.

@JustAnotherArchivist

Following compressed frames in the stream may have been
compressed using the dictionary.

A bit clunky wording. But allows for more than one dictionary per stream. Maybe add:

If a dictionary with the same ID has previously been registered, this replaces it for the following frames.

klauspost · 2021-06-21T09:14:37Z

This is a good start, but IMO it needs a bit more to be widely useful.

For me this would be an index that contains information about the start of each frame and where it is located in the stream.

This will have the stream seekable without having to read all data, and allow reading an arbitrary frame with only a few disk seeks.

What I could see is something like this...

[seekable stream indication with zero or more dictionaries]
[zstd frames]
[frame index]
For each non-skippable frame:
	* Absolute offset of beginning of frame
	* Compressed size of frame
	* Uncompressed Size
[index pointer]
	* Absolute offset of frame index, 8 bytes
[EOF]

Instead of adding only one dictionary, we instead upfront state that this stream is seekable. This could be a skippable frame as proposed, with some additional magic in the start to indicate that this is a seekable stream.

We allow specifying 0 or even several dictionaries up front. We keep them here so they can be read without seeking.

At the end a mapping of uncompressed position -> file offset is created for each frame. The frame must start at the absolute offset and the content must match "Uncompressed Size" - this effectively acts as "Frame Content Size" of the frame.

The frame must be able to decode using "Uncompressed Size" bytes from the specified offset. This can be used for HTTP range requests, etc.

Finally a frame with the absolute offset of the index is added. With a skippable frame this should be a fixed size, so it can always be read by reading X bytes back from the EOF.

This will allow appending data, by simply adding the new index with the added frames, and the existing index will simply be skipped.

Decoding a "frame" can then be done by 1) reading from the start, parse the dictionaries 2) Seek to the end, read last 64KB for example 3) check we have read enough by checking the index pointer, decode it 4) Seek to the frame we want.

So at most 4, likely 3 seeks to decode at specific content offset, while maintaining fully stream-able writes. When it is impossible to seek, you will have to decode all frames just like the proposal, but you still get one or more dictionaries.

I am tempted to allow some custom content for each frame, though that brings us rather close to a ZIP functionality. Maybe adding the option for custom, skippable data per frame would be nice.

yotann · 2021-09-15T00:35:44Z

@klauspost There's actually a seekable format defined already: https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md

I think this PR's dictionary-in-stream format is already compatible with that seekable format, although I'd need to check to make sure. (And it's certainly very important to make sure both formats are compatible.)

In general, I think it should be possible to use the dictionary-in-stream format independently from the seekable format, because there are a lot of variations on the seekable format that could be useful for different situations:

A linear lookup table in the Zstandard stream (like the existing format linked above)
A Btree-based lookup table in the Zstandard stream, to reduce the number of blocks read from disk
A custom table in a separate file that includes other information about the entries in the stream (like Wget-AT uses)
Lookup tables stored in some sort of separate database (small pieces of data go directly in the database; large pieces go in a Zstandard stream with only byte offsets stored in the database)

All these situations can still use the same dictionary-in-stream format. And a tool that understands the dictionary-in-stream format can read all the data in the stream even if it doesn't understand the seek table; it just won't be able to seek efficiently.

JustAnotherArchivist · 2021-09-15T02:25:37Z

I fully agree with that. They're both useful concepts, and there may be cases where both at the same time may be desired, but they're ultimately solving unrelated problems, and I feel it's best to keep them separate.

Regarding the compatibility of the two formats: I don't see any reason why they wouldn't be compatible. The two frames are essentially opaque to each other, and they're both standard skippable frames. The seek table would have to include the dict-in-stream frame as it would any other frame (or rather as any other skippable frame, with Decompressed_Size = 0), and the dictionary would not be relevant for the seek table frame since that frame isn't compressed. They're on the opposite ends of a stream, so there's no confusion over which would come first, and their magic numbers even differ as well.

The 'which comes first' bit made me think about future-proofing though. Perhaps this format should be amended to permit any number of other skippable frames before the dict-in-stream frame. Since skippable frames aren't compressed (unless their User_Data is explicitly defined as such, as with the potentially compressed dict in here), the dict wouldn't be relevant for them. And it would avoid conflict with future formats that also define skippable frames at the beginning of a stream.

klauspost · 2021-09-15T09:12:07Z

@yotann @JustAnotherArchivist Yeah, forgot about the already defined seekable format..

The dictionary data frames are relevant for future frames. First of all the index should show if there are embedded dictionaries. That can be done using one of the 'Reserved' bits. This will also make it clear to previous versions that it cannot decompress the output.

The main problem then is that it doesn't show if a skippable frame is or is not a dictionary. I would hate to force the reader to read all skippable frames, potentially skipping all over the file.

As I see it, we can do it in, without major breakage:

A) Add magic values for CRC, e.g. 0xffffffff. This will never be true since that is not the size 0 CRC. @Cyan4973 should be able to confirm.

B) Force dictionaries to be before any non-skippable frame.

C) Add a bit field byte to each Seek_Table_Entries in the definition.

I am in favor of A) since it will allow concatenating streams. It is on the limit of being too hacky, but the "Reserved Bit" can make it clear that it is incompatible.

C) is the "correct" solution, but requires parsing code to be rewritten significantly and adds too much complexity to the format IMO. I mainly just put it in here for completion.

tomberek · 2021-09-24T05:58:35Z

I'm a fan of (B) and (C)

(A): I'm looking at a use-case where we can do frame deduplication: https://github.com/facebook/zstd/pull/2737/files#diff-50e7772e96927d183da7ec7cac6a9f4e34f7e33ed00ca518611d0c52e0af4350R103
and magic CRC values would be awkward.

(B): Retain the invariant that a stream processor can pre-load everything as needed. So the dictionary would only be needed before the frames that require it. This should still allow concatenation.

(C): If the seek table index exposed the Dictionary_ID then a client that already had that dictionary wouldn't need to fetch or process it.

yotann · 2021-09-30T03:20:05Z

A) Add magic values for CRC, e.g. 0xffffffff. This will never be true since that is not the size 0 CRC.

We wouldn't need a magic value; all skippable frames will have Decompressed_Size=0, so we can repurpose the CRC field just for those frames. But then the new field would only be present when the CRC field is enabled, and there wouldn't be any room to distinguish between dictionary frames and other skippable frames, which is too awkward IMHO.

(C): If the seek table index exposed the Dictionary_ID then a client that already had that dictionary wouldn't need to fetch or process it.

That's a good idea. If a file has lots of dictionaries, searching for the right one could be expensive. 128 dictionaries of 8MiB each would mean seeking to 128 different places in the first 1GiB of the file. (Not sure if HDD seek time is relevant anymore, but HTTP request latency certainly can be.) In that case, it would be important to include Dictionary_IDs in the seek table.

This is what I think the most general design could look like:

Dictionary frames can be anywhere in the file
Each dictionary frame must precede any non-skippable frames that use it
If multiple dictionaries with the same Dictionary_ID are seen while reading the file sequentially, the most recent one takes precedence.
Dictionary frames are included in the seek table with Decompressed_Size=0, just like other skippable frames.
The seek table includes additional information about dictionary frames, using one of these ideas:
1. Each seek table entry has an additional 4-byte field giving its Dictionary ID. If Decompressed_Size=0, it marks a dictionary frame with the given ID; otherwise, this is a non-skippable frame and the ID indicates which dictionary is needed to decompress it.
2. Each seek table entry has an additional 4-byte field, which is the index (in the seek table) of the correct dictionary frame needed to decompress the current frame. -1 if no dictionary is used.

This is definitely overkill for our original use case. OTOH, it covers every use case I can think of with a reasonable amount of complexity.

tomberek · 2021-09-30T13:03:55Z

@yotann for (C), a slight modification would be to utilize the Seek_Table_Descriptor bitfield to indicate the presence or lack thereof of the dictionary entries. I suspect dictionary-in-stream will not be used all the time (or even most of the time), so it should be opt-in. But dictionaries are central enough to zstd to devote a reserved bit (or two, the same way the dict ID length is encoded in the Dictionary_ID_flag in the Frame_Header_Descriptor ) in that field to indicate their presence.

klauspost · 2021-10-05T13:15:25Z

I propose the following change to the seek table information:

Seek_Table_Descriptor

A bitfield describing the format of the seek table.

Bit number	Field name
7	`Checksum_Flag`
6	`Stream_Dictionaries`
5-2	`Reserved_Bits`
1-0	`Unused_Bits`

While only Checksum_Flag and Stream_Dictionaries currently exists, there are 6 other bits in this field that can be used for future changes to the format.

[...]

Stream_Dictionaries

When the stream dictionaries is set one or more dictionaries are present in the stream.

Dictionaries can be required to decode frames following the dictionary.

Enabling this adds 4 bytes to every seek table entry.

`Seek_Table_Entries`

Seek_Table_Entries consists of Number_Of_Frames (one for each frame in the data, not including the seek table frame) entries of the following form, in sequence:

`Compressed_Size`	`Decompressed_Size`	`[Checksum]`	`[Dictionary_ID]`
4 bytes	4 bytes	4 bytes	4 bytes

[...]

Dictionary_ID

Only present if Stream_Dictionaries is set in the Seek_Table_Descriptor.

This will contain the dictionary ID contained or used in the frame.

If this is a skippable frame (Decompressed_Size == 0) and Dictionary_ID is non-zero the skippable frame will contain the dictionary with the specified ID. If there are several entries with the same Dictionary_ID, the last entry before the frame that is to be decoded should be used.

If this is a non-skippable frame, this indicates the dictionary id required to decode the frame. If value is 0 no dictionary is required.

This will allow to maintain backwards compatibility and allow decompressing with the least number of seeks.

klauspost · 2021-10-07T08:44:36Z

contrib/dict_in_stream/zstd_dict_in_stream_format.md

+This document defines a format for including a Zstandard dictionary inside a
+compressed Zstandard stream. When combined with the


Suggested change

This document defines a format for including a Zstandard dictionary inside a

compressed Zstandard stream. When combined with the

This document defines a format for including Zstandard dictionaries inside a

compressed Zstandard stream. When combined with the

klauspost · 2021-10-07T08:51:29Z

contrib/dict_in_stream/zstd_dict_in_stream_format.md

+
+__`Compressed_or_Uncompressed_Dictionary`__
+
+The dictionary data, which may optionally be compressed.


Just to make it crystal clear:

Suggested change

The dictionary data, which may optionally be compressed.

The dictionary data, which may optionally be compressed.

This __must__ start with either little-endian 0xEC30A437 or 0xFD2FB528, otherwise is should be considered a regular skippable block.

yotann added 4 commits October 11, 2020 13:59

Describe Dictionary-in-Stream format

01cfd80

Add dictionary-in-stream header

5f01132

Add dictionary-in-stream compression code

f524167

Add dictionary-in-stream decompression code

c6121a5

yotann2 mentioned this pull request Oct 11, 2020

Amending annex D to promote zstd instead of gzip iipc/warc-specifications#53

Closed

facebook-github-bot added the CLA Signed label Oct 11, 2020

yotann added 2 commits October 11, 2020 19:19

Fix dictionary-in-stream compression code

9134b48

dict_in_stream: dict frames must not use dicts

0d62e19

yotann marked this pull request as ready for review November 4, 2020 16:23

klauspost reviewed Jun 21, 2021

View reviewed changes

klauspost reviewed Oct 7, 2021

View reviewed changes

JustAnotherArchivist mentioned this pull request Nov 23, 2021

CLI should fail if the dictionary file provided in -D does not contain a valid dictionary #2873

Closed

JustAnotherArchivist mentioned this pull request Jun 11, 2023

Dictionary mismatch #3655

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dictionary-in-stream format in contrib #2349

Add dictionary-in-stream format in contrib #2349

yotann commented Oct 11, 2020

facebook-github-bot commented Oct 11, 2020

facebook-github-bot commented Oct 11, 2020

Cyan4973 commented Oct 13, 2020

klauspost Jun 21, 2021

JustAnotherArchivist Sep 15, 2021

klauspost Sep 15, 2021

klauspost Oct 7, 2021

klauspost commented Jun 21, 2021 •

edited

Loading

yotann commented Sep 15, 2021

JustAnotherArchivist commented Sep 15, 2021 •

edited

Loading

klauspost commented Sep 15, 2021

tomberek commented Sep 24, 2021

yotann commented Sep 30, 2021

tomberek commented Sep 30, 2021

klauspost commented Oct 5, 2021 •

edited

Loading

klauspost Oct 7, 2021

klauspost Oct 7, 2021 •

edited

Loading

		a normal Zstandard stream. All compressed frames in the stream must have been
		compressed using the dictionary.

		This document defines a format for including a Zstandard dictionary inside a
		compressed Zstandard stream. When combined with the


		__`Compressed_or_Uncompressed_Dictionary`__

		The dictionary data, which may optionally be compressed.

	The dictionary data, which may optionally be compressed.
	The dictionary data, which may optionally be compressed.
	This __must__ start with either little-endian 0xEC30A437 or 0xFD2FB528, otherwise is should be considered a regular skippable block.

Add dictionary-in-stream format in contrib #2349

Are you sure you want to change the base?

Add dictionary-in-stream format in contrib #2349

Conversation

yotann commented Oct 11, 2020

facebook-github-bot commented Oct 11, 2020

facebook-github-bot commented Oct 11, 2020

Cyan4973 commented Oct 13, 2020

klauspost Jun 21, 2021

Choose a reason for hiding this comment

JustAnotherArchivist Sep 15, 2021

Choose a reason for hiding this comment

klauspost Sep 15, 2021

Choose a reason for hiding this comment

klauspost Oct 7, 2021

Choose a reason for hiding this comment

klauspost commented Jun 21, 2021 • edited Loading

yotann commented Sep 15, 2021

JustAnotherArchivist commented Sep 15, 2021 • edited Loading

klauspost commented Sep 15, 2021

tomberek commented Sep 24, 2021

yotann commented Sep 30, 2021

tomberek commented Sep 30, 2021

klauspost commented Oct 5, 2021 • edited Loading

Seek_Table_Entries

klauspost Oct 7, 2021

Choose a reason for hiding this comment

klauspost Oct 7, 2021 • edited Loading

Choose a reason for hiding this comment

klauspost commented Jun 21, 2021 •

edited

Loading

JustAnotherArchivist commented Sep 15, 2021 •

edited

Loading

klauspost commented Oct 5, 2021 •

edited

Loading

`Seek_Table_Entries`

klauspost Oct 7, 2021 •

edited

Loading