Determine threat models to call data compression #234

trianglesphere · 2022-01-28T18:39:55Z

trianglesphere
Jan 28, 2022

Prior discussion here: There is prior discussion here: #10

Things to consider

Gas metering
Uncompressed input that balloons the compressed inpu
Compressed input that serves as a zip bomb
Other weird things - multiple compressed outputs map to 1 decompression

trianglesphere · 2022-02-02T22:44:41Z

trianglesphere
Feb 2, 2022
Author

Summary of slack conversation:

Using a complex compression algorithms, particularly over batches of transactions makes it hard to figure out how to charge users.

Some ideas that were floated:

Charge for uncompressed size, not compressed size - easiest, but users don't get savings
Approximate the marginal increase in compressed size - for some algorithms this is trivial, for others it is incredibly hard. If the approximation is wrong users will be able to cost us more in fees than they pay
It is possible to refund fees at the end of the batch to avoid brittle logic in the execution engine, but this may not be desirable from a charging fees point of view.
It might be possible to construct a relatively simple algorithm that estimates entropy. This entropy could be paired with a block weighted compression efficacy metric.

Notes / drawbacks

I think determining the marginal increase in compressed size is not an easy task and is not something that can easily or efficiently be done in online processing with most compression algorithms
Including a "compression oracle" - i.e. charge users for compressed size could be a little brittle as the L2 execution engine intrinsic gas function must be changed.
Estimating intrinsic gas usage will definitely be harder
"Economic Attack": Users create a transaction that the fees paid as determined by the compression oracle do not match up against the effect that the user's transaction had on the compressed size. This is most likely to occur when estimating the effect with batch level compression.

My proposed options:

zstd with dictionary per tx. Gets about a 2x compression factor. Easy “compression oracle”
zstd with dict per batch: Gets about 2.68x factor (no dict is 2.5x). I could look into this more, but a potentially usable compression oracle is to copy Ori’s idea about keeping a scalar of how effective the compression is.
zero-byte per tx: Also easy to compute. Gets about a 1.6x factor. It’s the easiest to deal with, but also I think it’s worth the hassle to use a better algorithms

0 replies

trianglesphere · 2022-02-02T23:04:00Z

trianglesphere
Feb 2, 2022
Author

With batch level compression, it's clear that it's impossible to accurately determine the fees until "runtime" (transaction inclusion time). I suspect even then it's close to impossible to perfectly determine the fees.

I guess then the big decision for product are:

Per transaction compression
Per batch compression with exact gas fees
Per batch compression with estimated fees

Per Transaction
Pros

Very easy to compute what the user will pay in Gas
Not vulnerable to economic attacks (mismatch between what was charged and the batch submitter's cost)

Cons

Only get 1.7 to 2x compression factor (depends on algorithm)
May have to update multiple places for this

Per Batch
Pros

Better compression: ~2.5x compression factor

Cons of exact fees

Users either do not know fee until transaction execution time
This might actually reduce the compression ratio

Cons of estimated fees

this method is vulnerable to an economic attack. I believe it can be bounded, but some amount of slippage is possible.

Please let me know if there's another approach that you think I'm missing or if the pros/cons don't feel right.

0 replies

ben-chain · 2022-02-02T23:47:39Z

ben-chain
Feb 2, 2022

@trianglesphere this is great, I agree that these are probably the "big bucket" decisions. I can identify another tangential decision, which is how fee payment occurs. I can see worlds where:

Transaction fees are calculated up front to include the compression savings, and this is reflected when estimating fees e.g. using the OP js sdk
Transaction fees are charged up front as if uncompressed, and then refunded via system transactions.

The second option has advantages for implementation (can contain everything in the block derivation logic where the compression actually happens), but may be a worse product.

Separately, a couple of questions:

When you wrote out "My proposed options" in this comment, do those correspond to the 3 options in the next comment? If so, a question: are these options all generating a dict on-the-fly, at compression time, that gets submitted as part of the compressed data? Is there a way to have a static dict generated from a corpus of old transactions, and would that get the per-tx compression multiplier any closer to the per-batch compression?
When you say you believe that economic attack can be bounded, is the bound fixed even over time (e.g. system notices compression is failing, dials back params) or is it a bound % of fee volume?
How tightly coupled would the "compression estimator" function be to the compression algo? Do these have to be updated together to work well?

0 replies

trianglesphere · 2022-02-03T00:02:20Z

trianglesphere
Feb 3, 2022
Author

Separately, a couple of questions:

When you wrote out "My proposed options" in this comment, do those correspond to the 3 options in the next comment? If so, a question: are these options all generating a dict on-the-fly, at compression time, that gets submitted as part of the compressed data? Is there a way to have a static dict generated from a corpus of old transactions, and would that get the per-tx compression multiplier any closer to the per-batch compression?

Not quite. The second three are a refinement of the first three in terms of how to ask product (sorry for the confusion).
zstd/zero byte per tx (1&3 in the original) correspond to the first option to product. The zstd per batch (2 in the first) corresponds to the 2&3 for product (the difference is that the first recommendation didn't have as much about how to actually decide what fees would be).

When you say you believe that economic attack can be bounded, is the bound fixed even over time (e.g. system notices compression is failing, dials back params) or is it a bound % of fee volume?
Both? I think there are two components: a time weighted parameter that adjusts under attack. That's critical for a long term attack, but even when the attack is starting, I think it'd be hard for there to be more than a 2.5x difference in what we charge users vs what gets posted (and this delta is likely to be smaller in practice).

I'm not sure on the exact loss that we can limit it to, but we'll never get in s state where we are undercharging by a factor of 10 or something like that.

How tightly coupled would the "compression estimator" function be to the compression algo? Do these have to be updated together to work well?
It depends how complex in the estimator is. In practice if we update the algorithm, we would at least want to check if not upgrade the estimator. It's also possible that we would want to upgrade the estimator more than the compression algorithm.

0 replies

trianglesphere · 2022-02-03T00:04:20Z

trianglesphere
Feb 3, 2022
Author

I agree that these are probably the "big bucket" decisions. I can identify another tangential decision, which is how fee payment occurs. I can see worlds where:

Transaction fees are calculated up front to include the compression savings, and this is reflected when estimating fees e.g. using the OP js sdk

Transaction fees are charged up front as if uncompressed, and then refunded via system transactions.

Yep. It seems like we are shying away from the refund, but it's definitely an option worth exploring. I don't think it actually make "perfect" compressor estimation possible (without a huge overhead), but it does mean that the protocol itself cannot be exploited.

0 replies

norswap · 2022-02-03T00:38:13Z

norswap
Feb 3, 2022

I think determining the marginal increase in compressed size is not an easy task and is not something that can easily or efficiently be done in online processing with most compression algorithms

I'd like to nuance this somewhat, I think it's feasible under a few conditions.

Two scenarios:

the transaction is sent to be executed.

Then in a streaming algorithm (the algorithm lets you supply the data bit by bit and not only all at once), it should be feasible.
- You need a way to retrieve the size of the compressed stream (and of the dictionary, if that's not included).
  - It's probably (?) feasible to modify an implementation to add this.

the transaction is sent to be simulated.
- Here the difficulty is that we want to "pretend" to add the transaction to the compressed data, but not actual commit the modification.
  - It's feasible, but could be a a lot more hairy, depending on the implementation.

Note that a property of this scheme is that transaction tend to become cheaper as the length of the batch increases. The first transaction effectively gets the same compression as if it was compressed on its own, but each successive entry benefits from previous dictionary entries.

Including a "compression oracle" - i.e. charge users for compressed size could be a little brittle as the L2 execution engine intrinsic gas function must be changed.

I'm not exactly sure what you mean here. Intrinsic gas in L1 is 21k + calldata cost. What is it currently on L2 (where we do charge the "L1 fee" for calldata posted on L1) and how will this proposition impact that? (My impression being that it's already changed, so we'd just be changing differently.) What does it mean to be "brittle" in this context?

0 replies

trianglesphere · 2022-02-03T00:48:21Z

trianglesphere
Feb 3, 2022
Author

I'm not exactly sure what you mean here. Intrinsic gas in L1 is 21k + calldata cost. What is it currently on L2 (where we do charge the "L1 fee" for calldata posted on L1) and how will this proposition impact that? (My impression being that it's already changed, so we'd just be changing differently.) What does it mean to be "brittle" in this context?

This was rephrasing Ben's brittle comment. If we change how charge for L2 calldata it means having to keep multiple different sources in sync.

0 replies

trianglesphere · 2022-02-03T00:56:39Z

trianglesphere
Feb 3, 2022
Author

@norswap So both zlib and zstd support streaming compression algorithms with flushing. I'm a little worried that excessive flushing would cause problems, but that is something that I can easily benchmark so I should just stop speculating.

For estimating the fees I'm more comfortable guessing based on how the transaction itself compresses and what the current compression ratio is on mainnet. This is something also worth benchmarking with the dictionary versions (a big hope is that they help with the successive tx problem).

0 replies

norswap · 2022-02-03T01:08:28Z

norswap
Feb 3, 2022

Indeed! I like the idea of charging the per-tx compression and compressing per batch anyway.

We could refund the difference but it's tricky to know how to split it, and frankly I doubt it will affect price perception much, so we might as well keep it and do something more impactful with it. Who knows, maybe long term this compression advantage becomes the entirety of the sequencer's profits?

0 replies

trianglesphere · 2022-02-03T01:14:31Z

trianglesphere
Feb 3, 2022
Author

Yea, compressing the tx in isolation is one way of getting a decent estimation of it's weight. Maybe you charge based on the fraction tx_compressed_size/sum_individual_compressed_size - that's a little better than using uncompressed sizes for the fraction but still close enough

0 replies

trianglesphere · 2022-02-03T01:15:32Z

trianglesphere
Feb 3, 2022
Author

And flushing does slightly reduce compressed size, but not as much as I thought

zlib, total                     299967775       2.49x
zlib, batches/flush             327750890       2.28x
zlib dict, total                296083645       2.52x
zlib dict, batches/flush        323635975       2.31x
zstd, total                     299898133       2.49x
zstd, batches/flush             326573794       2.29x
zstd dict2, total               278905351       2.68x   
zstd dict2, batches/flush       306103201       2.44x

0 replies

esaulpaugh · 2022-07-04T15:03:59Z

esaulpaugh
Jul 4, 2022

My library deserialises ABI arguments with RLP by using the ABI schema to differentiate between signed and unsigned values. Could that not work?

The RLP argument in:

java -jar headlong-cli-1.0.jar -me "(function[2][][],bytes24,string[1][1],address[],uint72,(uint8),(int16)[2][][1],(int32)[],uint40,(int48)[],(uint),bool,string,bool[2],int24[],uint40[1])" "f4f3f298191c766e29a65787b7155dd05f41292438467db93420cade98191c766e29a65787b7155dd05f41292438467db93420cade98191c766e29a65787b7155dd05f41292438467db93420cadec2c17ad594ff00ee01dd02cc03cafebabe990688077708660989fdfffffffffffffe04c107c8c7c6c109c382fff5c8c111c584ffffffed85fca527923bcac17ec786ffffffffff82c10a01866661726f7574c20101c6031483fffffac584fffffffe"

Unpacks to ABI:

0000000000000000000000000000000000000000000000000000000000000220191c766e29a65787b7155dd05f41292438467db93420cade000000000000000000000000000000000000000000000000000000000000000000000000000002c000000000000000000000000000000000000000000000000000000000000003400000000000000000000000000000000000000000000000fdfffffffffffffe04000000000000000000000000000000000000000000000000000000000000000700000000000000000000000000000000000000000000000000000000000003800000000000000000000000000000000000000000000000000000000000000400000000000000000000000000000000000000000000000000000000fca527923b0000000000000000000000000000000000000000000000000000000000000460000000000000000000000000000000000000000000000000000000000000000a000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000004c000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000050000000000000000000000000000000000000000000000000000000000fffffffe000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000200000000000000000000000000000000000000000000000000000000000000001191c766e29a65787b7155dd05f41292438467db93420cade0000000000000000191c766e29a65787b7155dd05f41292438467db93420cade00000000000000000000000000000000000000000000000000000000000000000000000000000020000000000000000000000000000000000000000000000000000000000000002000000000000000000000000000000000000000000000000000000000000000017a000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000ff00ee01dd02cc03cafebabe9906880777086609000000000000000000000000000000000000000000000000000000000000002000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000009fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff500000000000000000000000000000000000000000000000000000000000000020000000000000000000000000000000000000000000000000000000000000011ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffed0000000000000000000000000000000000000000000000000000000000000002000000000000000000000000000000000000000000000000000000000000007effffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff8200000000000000000000000000000000000000000000000000000000000000066661726f75740000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000000000000000000000000000000000000014fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffa

https://github.com/esaulpaugh/headlong/blob/master/src/main/java/com/esaulpaugh/headlong/util/SuperSerial.java

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine threat models to call data compression #234

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Determine threat models to call data compression #234

trianglesphere Jan 28, 2022

Replies: 12 comments

trianglesphere Feb 2, 2022 Author

trianglesphere Feb 2, 2022 Author

ben-chain Feb 2, 2022

trianglesphere Feb 3, 2022 Author

trianglesphere Feb 3, 2022 Author

norswap Feb 3, 2022

trianglesphere Feb 3, 2022 Author

trianglesphere Feb 3, 2022 Author

norswap Feb 3, 2022

trianglesphere Feb 3, 2022 Author

trianglesphere Feb 3, 2022 Author

esaulpaugh Jul 4, 2022

trianglesphere
Jan 28, 2022

trianglesphere
Feb 2, 2022
Author

trianglesphere
Feb 2, 2022
Author

ben-chain
Feb 2, 2022

trianglesphere
Feb 3, 2022
Author

trianglesphere
Feb 3, 2022
Author

norswap
Feb 3, 2022

trianglesphere
Feb 3, 2022
Author

trianglesphere
Feb 3, 2022
Author

norswap
Feb 3, 2022

trianglesphere
Feb 3, 2022
Author

trianglesphere
Feb 3, 2022
Author

esaulpaugh
Jul 4, 2022