feat: more production ready gateway batcher design#227
feat: more production ready gateway batcher design#227
Conversation
|
You have run out of free Bugbot PR reviews for this billing cycle. This will reset on February 15. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
There was a problem hiding this comment.
Pull request overview
This pull request significantly refactors the World ID Gateway's batching architecture to make it more production-ready. The changes introduce an event-driven system with a global EventMultiplexer, adaptive gas-based batch sizing, operation retry logic, and improved monitoring capabilities.
Changes:
- Introduces EventMultiplexer pattern for decoupled event handling with metrics, logging, and status tracking
- Implements adaptive batch sizing based on chain conditions (gas policy and chain monitor)
- Adds retry logic for operations that fail due to transient RPC errors
- Refactors configuration to use a shared
commoncrate for provider management - Adds comprehensive load testing infrastructure
Reviewed changes
Copilot reviewed 37 out of 38 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| services/gateway/src/batcher/ | New unified batching system with gas policy, chain monitoring, and event-driven architecture |
| services/gateway/src/routes/middleware.rs | New validation and simulation middleware for operation routes |
| services/gateway/src/types.rs | Refactored app state to use EventMultiplexer and registry instance |
| services/gateway/src/routes.rs | Updated to build event bus and ops batcher with new architecture |
| services/common/ | New shared crate for provider configuration and management |
| services/gateway/tests/load_test.rs | Comprehensive load testing suite |
| services/gateway/Cargo.toml | Added dependencies for metrics, OTLP, and event handling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d5e3d23 to
fd667f6
Compare
| let mc = Multicall3::new(MULTICALL3, provider.clone()); | ||
|
|
||
| loop { | ||
| // 1. Simulate with allowFailure=true to identify failing ops |
There was a problem hiding this comment.
does this add value? how often do we expect failing ops?
There was a problem hiding this comment.
It adds value in that now a failing op can never nullify an entire batch which was the case previously. I don't have enough context to know how often we do or do not expect this, but generally this seems like the better design.
If we never encounter this, then it will have no affect on batching times, and if we do encounter failed ops frequently it will never affect failure rates.
| block_gas_limit: 100_000_000, | ||
| max_base_fee: 400_000_000_000, // 4 gwei | ||
| target_base_fee: 1, | ||
| backlog_threshold: 2_000, |
There was a problem hiding this comment.
Perhaps lacking context, I would make these dynamic, retrievable from a chainconfig in aws appconfig
There was a problem hiding this comment.
100% these defaults all need to be carefully considered.
There was a problem hiding this comment.
left some comments. It's a bit complicated to understand everything deeply but I see the overall logic.
A bigger question that I have is: why not using tx-sitter for all these operations (at least the ones related to send a tx onchain) instead of re-creating a similar service? This way we could simplify a lot this code and lets tx-sitter (that we already have and maintain) take care of tx handling.
| } | ||
|
|
||
| // Re-export for backwards compatibility | ||
| pub type BatcherHandle = Commands; |
There was a problem hiding this comment.
why not directly call it BatcherHandle instead of immediately creating a helper type?
| /// Trait for gas policy implementations. | ||
| /// | ||
| /// Allows different gas strategies to be plugged into the batcher. | ||
| pub trait GasPolicyTrait: Send + Sync + 'static { |
There was a problem hiding this comment.
is this trait needed? You're implementing it only for GasPolicy, you could probably remove it and directly use the concrete type
There was a problem hiding this comment.
I think batch sizing is actually very nuanced. e.g. The policy determining batch sizes may need to change. This just makes changing the policy require minimal code changes.
Fine either way though
|
|
||
| // Fee pressure is amplified by positive trend (fees rising) | ||
| // and dampened by negative trend (fees falling) | ||
| let trend_factor = 1.0 + 0.5 * chain.base_fee_trend; |
There was a problem hiding this comment.
what's the formula / reason behind this opearation?
There was a problem hiding this comment.
I should maybe write some better comments explaining the rationale here. But the constraint space we are working in is basically relieve back pressure as fast as possible while not making the base fee go exponential. There's two things we need to consider fee pressure (how expensive gas is) and queue pressure (how much pending work is backed up).
When gas is cheap and the queue is large, we target 90% block utilization. When gas is expensive, and the queue is short we target a 10% utilization. The trend of the base fee is taken into consideration as well. If the trend is an upward pressure on the base fee our target utilization should scale down accordingly.
Formulas:
fee_pressure = (base_fee - target_base_fee) / (max_base_fee - target_base_fee)
[clamped 0-1]
queue_pressure = queue_depth / backlog_threshold
[clamped 0-1]
trend_factor = 1.0 + (0.5 × base_fee_trend)
adjusted_fee_pressure = fee_pressure × trend_factor
[clamped 0-1]
net_pressure = adjusted_fee_pressure - queue_pressure
[range -1 to 1]
target_utilization = 0.5 - (0.4 × net_pressure)
[clamped 0.1-0.9]
gas_budget = block_gas_limit × target_utilization
| } | ||
| } | ||
|
|
||
| /// Submit createManyAccounts with retry. |
There was a problem hiding this comment.
I don't see the retry here
I think this is a valid take. I was simply increasing the robustness of the current implementation, but outsourcing transaction mining seems like a valid strategy. Generally my concern is that we are going to make the base fee go exponential without intelligent batch sizing, which this PR attempts to fix. Transaction mining however can be outsourced to tx-sitter you are correct. I would consider this out of scope for this PR though |
This PR is stacked on #210, and addresses #183
Merges the create handle, and ops batcher into a single batcher which prioritizes account creations. This is important in times when throughput (potentially) exceeds chain capacity.
Makes the batcher aware of world chains gas when determining batch sizes. Also updates batch sizing to be calculated in terms of gas per individual operation instead of just a general max operations.
Automatically retries failed batches at transaction submission until all ops are determined to be invalid, or valid operations are finally submitted.
Opsnot account creations.Minor refactor, and cleaner abstractions over request validation and request id generation