-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block splitter #4136
Block splitter #4136
Conversation
mmmh,
I suspect the problem is that the macro I'm not completely sure what's the wanted behavior here... edit: confirmed that, when I change the default block size to anything other than 128 KB, it breaks this test. |
Who knew adding a single source file ( currently blocked trying to get the single-file library builder to work, and then each and every build system also requires updating its own list of files in its own format and location. |
Weird stuff :
It only happens during compilation of the The failure seems to correspond to where And of course, it happens all the time on github CI, but not on any other system I can test the same code and build rule with. |
a4c653c
to
904fa69
Compare
All tests passed, ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed everything except zstd_preSplit.[hc]
CC=clang CFLAGS="-O3 -m32" FUZZER_FLAGS="--long-tests" make asan-fuzztest | ||
|
||
# The following test seems to have issues on github CI specifically, | ||
# it does not provide the `__mulodi4` instruction emulation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it have issues before this PR, or only after? If it is the latter, this might also have issues in the kernel, and we would need to do something similar to what we do with ZSTD_div64()
Line 82 in b880f20
#define ZSTD_div64(dividend, divisor) ((dividend) / (divisor)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this might be a compiler bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the issue doesn't reproduce on my local linux workstation.
I suspect a bug in the version of ubsan
for clang
currently deployed within github ci.
It's likely that this issue will be fixed at some unspecified time in the future.
We just can't wait for this issue to be fixed, the CI needs to continue running.
We'll re-enable this test when it can work properly on Github CI again.
Also, dropping temporarily the ubsan
test for clang
for 32-bit x86
targets seems like a minor inconvenience, given that:
- we mostly care about 64-bit
x64
ubsan
32-bit is still running in CI withgcc
- we keep the more important
asan
test for 32-bit onclang
lib/common/compiler.h
Outdated
@@ -278,6 +278,8 @@ | |||
* Alignment check | |||
*****************************************************************/ | |||
|
|||
#define ZSTD_IS_POWER_2(a) (((a) & ((a)-1)) == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We don't need this for compile time constants like array bounds, could we make this a function ZSTD_isPow2()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
lib/compress/zstd_compress.c
Outdated
* heuristic, tested as being "generally better". | ||
* do not split incompressible data though: respect the 3 bytes per block overhead limit. | ||
*/ | ||
return savings ? 92 KB : 128 KB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is splitting if the savings is non-zero. So it will split if the savings is negative. Is that correct? If so why do we want to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, savings
will only be >0
if there were some observed savings (aka, smaller cSize
than srcSize
) from previous blocks.
It also means that, for the first block, it will never split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this currently will also split if the savings is negative, which doesn't seem like the desired behavior. Is this what you intended?
return savings ? 92 KB : 128 KB; | |
return savings > 0 ? 92 KB : 128 KB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, we need to compute how much overhead this can possibly add, and ensure that this block split can never grow us beyond the compress bound.
E.g.
return savings ? 92 KB : 128 KB; | |
return savings >= 3 ? 92 KB : 128 KB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this currently will also split if the savings is negative, which doesn't seem like the desired behavior. Is this what you intended?
That's a very good point.
In my mind, savings
was boxed to remain positive,
which it is as long as it remains within compress_frameChunk()
.
But as soon as we leave that function and return (streaming scenario),
savings
is regenerated from actual bytes consumes and written,
which can be negative if data was so far incompressible.
This is fixed in newer version, thanks to the >3
test.
Great catch @terrelln !
lib/compress/zstd_compress.c
Outdated
/* note: conservatively only split full blocks (128 KB) currently, | ||
* and even then only if there is more than 128 KB input remaining. | ||
*/ | ||
if (srcSize <= 128 KB || blockSizeMax < 128 KB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the block splitter won't work in streaming mode? What if the user passes in exactly 128KB at a time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, if the user passes exactly 128 KB at a time, block splitting will not be triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is a good idea. In my model of the streaming API, how you chunk your data doesn't impact the compression, with the exception of the final block. This seems like it could introduce confusing behavior. Can we get rid of this restriction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All full blocks are now sent to the splitter function, irrespective of the presence of additional data beyond the full block.
lib/compress/zstd_compress.c
Outdated
@@ -4556,7 +4584,7 @@ static size_t ZSTD_compress_frameChunk(ZSTD_CCtx* cctx, | |||
} | |||
} /* if (ZSTD_useTargetCBlockSize(&cctx->appliedParams))*/ | |||
|
|||
|
|||
if (cSize < blockSize) savings += (blockSize - cSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to subtract savings if we grow the block?
If we aren't precisely tracking the savings
I can see cases where an adversarial pattern could cause compression to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we can add some malus when a block is not compressed.
This will be a conservative over-estimate, but that's not important, we just want to make sure there is no possible scenario that could expand data by more than 3-bytes per full block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we can add some malus when a block is not compressed.
I don't like this solution because it is a patch that hides the problem. But I don't see how it guarantees that it can never occur. I'd prefer to precisely calculate the savings
so we can know exactly if we are allowed to split or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a paragraph which explains why a 1-byte malus to savings
for each incompressible block is enough to guarantee that the 3-bytes per 128 KB expansion limit cannot be breached.
* with alignment control | ||
* Note : should happen only once, at workspace first initialization | ||
*/ | ||
MEM_STATIC void* ZSTD_cwksp_reserve_object_aligned(ZSTD_cwksp* ws, size_t byteSize, size_t alignment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need alignment smaller than 64
? It seems it might be simpler to just keep 64-byte alignment on all allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it could also be done this way.
As align-8 is sufficient, align-64 felt overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Even if this is a desirable change (why?), this seems like it could be done in a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new TMP_WORKSPACE
requires an 8-bytes alignment.
The original ZSTD_cwksp_reserve_object()
that was used here was aligned on sizeof(void*)
.
It was breaking tests on 32-bit systems.
lib/compress/zstd_preSplit.c
Outdated
typedef struct { | ||
int events[HASHTABLESIZE]; | ||
S64 nbEvents; | ||
} FingerPrint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Just one word
} FingerPrint; | |
} Fingerprint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
lib/compress/zstd_preSplit.c
Outdated
|
||
#define CHUNKSIZE (8 << 10) | ||
/* Note: technically, we use CHUNKSIZE, so that's 8 KB */ | ||
size_t ZSTD_splitBlock_4k(const void* src, size_t srcSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we pick a better name, since as noted we're splitting by 8 KB currently? Maybe ZSTD_splitBlock_byFixedChunks()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this name is fixed in the next patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and merged the next patch (sample5) in this feature branch,
so that it also updates the naming convention in the process.
lib/compress/zstd_preSplit.c
Outdated
initStats(fpstats); | ||
recordFingerprint(&fpstats->pastEvents, p, CHUNKSIZE); | ||
for (pos = CHUNKSIZE; pos < blockSizeMax; pos += CHUNKSIZE) { | ||
assert(pos <= blockSizeMax - CHUNKSIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this holds because blockSizeMax == (128 << 10)
, but can we fix the looping logic to guarantee this instead?
It would be cleaner to have this function support any blockSizeMax
. Given that the checks in outer loop isn't super performance critical, I'd prefer clarity and explicit checks rather than assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
lib/compress/zstd_preSplit.c
Outdated
} else { | ||
mergeEvents(&fpstats->pastEvents, &fpstats->newEvents); | ||
ZSTD_memset(&fpstats->newEvents, 0, sizeof(fpstats->newEvents)); | ||
penalty = penalty - 1 + (penalty == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Lets rewrite for clarity
penalty = penalty - 1 + (penalty == 0); | |
if (penalty > 0) { | |
--penalty; | |
} |
lib/compress/zstd_cwksp.h
Outdated
@@ -272,7 +276,7 @@ MEM_STATIC size_t ZSTD_cwksp_bytes_to_align_ptr(void* ptr, const size_t alignByt | |||
* which we can allocate from the end of the workspace. | |||
*/ | |||
MEM_STATIC void* ZSTD_cwksp_initialAllocStart(ZSTD_cwksp* ws) { | |||
return (void*)((size_t)ws->workspaceEnd & ~(ZSTD_CWKSP_ALIGNMENT_BYTES-1)); | |||
return (void*)((size_t)ws->workspaceEnd & (size_t)~(ZSTD_CWKSP_ALIGNMENT_BYTES-1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this cast is actually correct. You need to cast inside the not expression if your goal is to ensure that the not doesn't happen to a narrower type that then gets zero-extended during promotion to the size_t
.
Also should we be casting to intptr_t
, not size_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should we be casting to
intptr_t
, notsize_t
?
Unfortunately, intptr_t
is not guaranteed to exist, while size_t
is.
Also, since this is just for alignment checks, we just care about the last bits (realistically, the last 6 bits maximum for cache-line alignment). So we could even use a smaller type. But size_t
is good enough already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewrote ZSTD_cwksp_initialAllocStart()
for clarity
lib/compress/zstd_compress.c
Outdated
@@ -137,11 +137,12 @@ ZSTD_CCtx* ZSTD_initStaticCCtx(void* workspace, size_t workspaceSize) | |||
ZSTD_cwksp_move(&cctx->workspace, &ws); | |||
cctx->staticSize = workspaceSize; | |||
|
|||
/* statically sized space. entropyWorkspace never moves (but prev/next block swap places) */ | |||
/* statically sized space. tmpWorkspace never moves (but prev/next block swap places) */ | |||
if (!ZSTD_cwksp_check_available(&cctx->workspace, ENTROPY_WORKSPACE_SIZE + 2 * sizeof(ZSTD_compressedBlockState_t))) return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we basically have to replace every usage of ENTROPY_WORKSPACE_SIZE
with TMP_WORKSPACE_SIZE
? Including here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not every instance, since in some cases it's really "just" about entropy,
but quite possibly here.
This might be a left over.
edit :
yes, it was a left over, to be fixed.
Great catch @felixhandte !
* with alignment control | ||
* Note : should happen only once, at workspace first initialization | ||
*/ | ||
MEM_STATIC void* ZSTD_cwksp_reserve_object_aligned(ZSTD_cwksp* ws, size_t byteSize, size_t alignment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Even if this is a desirable change (why?), this seems like it could be done in a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you should be able to avoid all of these cwksp changes if you just switch the entropy workspace allocation from being in the objects group (and allocation phase) to being in the aligned group.
This should just mean moving the allocation down into ZSTD_reset_matchState()
where we do the other aligned allocations as well as changing:
zstd/lib/compress/zstd_compress.c
Line 1709 in b880f20
size_t const entropySpace = ZSTD_cwksp_alloc_size(ENTROPY_WORKSPACE_SIZE); |
to use ZSTD_cwksp_aligned_alloc_size()
.
The ZSTD_SLIPBLOCK_WORKSPACESIZE
is 8208, which is not an even divisor of 64. So it does mean we'll grow the workspace to be a multiple of 64 and throw away 48 bytes. We could look at improving this in a follow-up PR.
Addressed the last comment by removing the need for 64-bit members in the workspace structure, so that it doesn't need 8-bytes alignment. Now, Note that signed 64-bit multiplications are still needed for the distance function, so this doesn't address the problem of a bug in the Github CI |
All comments addressed |
instead of ingesting only full blocks, make an analysis of data, and infer where to split.
for better portability on Linux kernel
though I really wonder if this is a property worth maintaining.
for non 64-bit systems
let's fill the initial stats directly into target fingerprint
that samples 1 in 5 positions. This variant is fast enough for lazy2 and btlazy2, but it's less good in combination with post-splitter at higher levels (>= btopt).
reported by @terrelln
…loss ensure data can never be expanded by more than 3 bytes per full block.
suggested by @terrelln
suggested by @terrelln
following a discussion with @felixhandte
detected by @felixhandte
due to integration of `sample5` strategy, leading to better compression ratios on a range of levels
strict C90 compliance test
so that it can be stored using standard alignment requirement (sizeof(void*)). Distance function still requires 64-bit signed multiplication though, so it won't change the issue regarding the bug in ubsan for clang 32-bit on github ci.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in ZSTD_compress_frameChunk()
doesn't look like it guarantees that it respects ZSTD_compressBound()
. It looks like it is very likely to respect the bound. But I'm positive that the fuzzers will eventually find a case where it doesn't.
lib/compress/zstd_compress.c
Outdated
/* note: conservatively only split full blocks (128 KB) currently, | ||
* and even then only if there is more than 128 KB input remaining. | ||
*/ | ||
if (srcSize <= 128 KB || blockSizeMax < 128 KB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is a good idea. In my model of the streaming API, how you chunk your data doesn't impact the compression, with the exception of the final block. This seems like it could introduce confusing behavior. Can we get rid of this restriction?
lib/compress/zstd_compress.c
Outdated
* heuristic, tested as being "generally better". | ||
* do not split incompressible data though: respect the 3 bytes per block overhead limit. | ||
*/ | ||
return savings ? 92 KB : 128 KB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this currently will also split if the savings is negative, which doesn't seem like the desired behavior. Is this what you intended?
return savings ? 92 KB : 128 KB; | |
return savings > 0 ? 92 KB : 128 KB; |
lib/compress/zstd_compress.c
Outdated
* heuristic, tested as being "generally better". | ||
* do not split incompressible data though: respect the 3 bytes per block overhead limit. | ||
*/ | ||
return savings ? 92 KB : 128 KB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, we need to compute how much overhead this can possibly add, and ensure that this block split can never grow us beyond the compress bound.
E.g.
return savings ? 92 KB : 128 KB; | |
return savings >= 3 ? 92 KB : 128 KB; |
lib/compress/zstd_compress.c
Outdated
@@ -4556,7 +4584,7 @@ static size_t ZSTD_compress_frameChunk(ZSTD_CCtx* cctx, | |||
} | |||
} /* if (ZSTD_useTargetCBlockSize(&cctx->appliedParams))*/ | |||
|
|||
|
|||
if (cSize < blockSize) savings += (blockSize - cSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we can add some malus when a block is not compressed.
I don't like this solution because it is a patch that hides the problem. But I don't see how it guarantees that it can never occur. I'd prefer to precisely calculate the savings
so we can know exactly if we are allowed to split or not.
this helps make the streaming behavior more consistent, since it does no longer depend on having more data presented on the input. suggested by @terrelln
I updated the blind-split strategy with the recommended updates from the code review. That being said, the blind-split strategy was proposed initially as a way to leverage "something" from the capability to ingest partial blocks for lower levels, for which the initial split heuristic was not fast enough. With now an execution plan which seems able to offer block splitting all the way down to level 1, it makes the blind-split strategy less important, now essentially reserved to negative compression levels. Issue is, for negative compression levels, the blind-split strategy isn't effective. This is likely because negative compression levels drop the huffman compression of literals, which is where it had the most impact. So, if it's not useful anywhere, it's debatable if this split strategy deserves to stay in the code. An alternative could be to just drop it. |
instead of just for blind split. This is in anticipation of adversarial input, that would intentionally target the sampling pattern of the split detector. Note that, even without this protection, splitting can never expand beyond ZSTD_COMPRESSBOUND(), because this upper limit uses a 1KB block size worst case scenario, and splitting never creates blocks thath small. The protection is more to ensure that data is not expanded by more than 3-bytes per 128 KB full block, which is a much stricter limit.
first block is no longer splitted since adding the @Savings over-split protection
I updated the policy around This is in anticipation of some adversarial input that would take advantage of the sampling pattern of faster split detectors, Note that, even without this protection, the The problem is more about |
* otherwise only full blocks are used. | ||
* But being conservative is fine, | ||
* since splitting barely compressible blocks is not fruitful anyway */ | ||
savings += (S64)blockSize - (S64)cSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! IMO this method is also more clear, because it is clear that it is tracking exactly the savings from compression.
Instead of ingesting full blocks only (128 KB),
make an a-priori analysis of the data,
and infer a position to split a block at a more appropriate boundary.
This can notably happen in an archive scenario, at the boundary between 2 files of different nature within the archive.
This leads to some non-trivial compression gains, for a correspondingly acceptable speed cost.
The benefit is higher when there isn't already a post-splitter (like in higher
btopt
levels and above (16+)),but even when a post-splitter is active, there is still some small compression ratio benefit, making this strategy desirable even for higher compression modes.
However, this input analysis is not free.
The initial variant is currently reserved for higher compression strategies (>=
btopt
) where it combines well with the post-splitter operating on known sequences.A second, faster variant, is applied to middle strategies
btlazy2
andlazy2
, where it improves compressed size by ~300 KB. It achieves its faster speed by sampling the input, instead of scanning it sequentially.For faster modes, the analysis is skipped, and replaced by a static split size (no longer limited to 128 KB only). Through tests, it appears that a static 92 KB block size brings higher compression ratio, at a small-to-negligible compression speed loss (mostly due to increased nb of blocks, hence of block headers).
Here are some benchmarks, focusing on compression savings:
silesia.tar
:dev
PR
calgary.tar
:dev
PR
Follow up :
splitting
strategy selectable via compression parameter