Skip to content

feat: more production ready gateway batcher design#227

Open
0xOsiris wants to merge 23 commits intomainfrom
osiris/gateway-batcher
Open

feat: more production ready gateway batcher design#227
0xOsiris wants to merge 23 commits intomainfrom
osiris/gateway-batcher

Conversation

@0xOsiris
Copy link
Contributor

@0xOsiris 0xOsiris commented Jan 19, 2026

This PR is stacked on #210, and addresses #183

  • Merges the create handle, and ops batcher into a single batcher which prioritizes account creations. This is important in times when throughput (potentially) exceeds chain capacity.

  • Makes the batcher aware of world chains gas when determining batch sizes. Also updates batch sizing to be calculated in terms of gas per individual operation instead of just a general max operations.

  • Automatically retries failed batches at transaction submission until all ops are determined to be invalid, or valid operations are finally submitted.

    • A failed batch (should) never cause a valid operation to fail with an invalid status. Right now I've only addressed for Ops not account creations.
  • Minor refactor, and cleaner abstractions over request validation and request id generation

@cursor
Copy link

cursor bot commented Jan 19, 2026

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on February 15.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@0xOsiris 0xOsiris changed the title Osiris/gateway batcher feat: more production ready gateway batcher design Jan 19, 2026
@0xOsiris 0xOsiris marked this pull request as draft January 19, 2026 21:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request significantly refactors the World ID Gateway's batching architecture to make it more production-ready. The changes introduce an event-driven system with a global EventMultiplexer, adaptive gas-based batch sizing, operation retry logic, and improved monitoring capabilities.

Changes:

  • Introduces EventMultiplexer pattern for decoupled event handling with metrics, logging, and status tracking
  • Implements adaptive batch sizing based on chain conditions (gas policy and chain monitor)
  • Adds retry logic for operations that fail due to transient RPC errors
  • Refactors configuration to use a shared common crate for provider management
  • Adds comprehensive load testing infrastructure

Reviewed changes

Copilot reviewed 37 out of 38 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
services/gateway/src/batcher/ New unified batching system with gas policy, chain monitoring, and event-driven architecture
services/gateway/src/routes/middleware.rs New validation and simulation middleware for operation routes
services/gateway/src/types.rs Refactored app state to use EventMultiplexer and registry instance
services/gateway/src/routes.rs Updated to build event bus and ops batcher with new architecture
services/common/ New shared crate for provider configuration and management
services/gateway/tests/load_test.rs Comprehensive load testing suite
services/gateway/Cargo.toml Added dependencies for metrics, OTLP, and event handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI commented Jan 19, 2026

@0xOsiris I've opened a new pull request, #228, to work on those changes. Once the pull request is ready, I'll request review from you.

@0xOsiris 0xOsiris force-pushed the osiris/gateway-batcher branch from d5e3d23 to fd667f6 Compare January 21, 2026 16:19
@0xOsiris 0xOsiris marked this pull request as ready for review January 22, 2026 23:48
let mc = Multicall3::new(MULTICALL3, provider.clone());

loop {
// 1. Simulate with allowFailure=true to identify failing ops
Copy link

@aurel-fr aurel-fr Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this add value? how often do we expect failing ops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It adds value in that now a failing op can never nullify an entire batch which was the case previously. I don't have enough context to know how often we do or do not expect this, but generally this seems like the better design.

If we never encounter this, then it will have no affect on batching times, and if we do encounter failed ops frequently it will never affect failure rates.

Comment on lines 27 to 30
block_gas_limit: 100_000_000,
max_base_fee: 400_000_000_000, // 4 gwei
target_base_fee: 1,
backlog_threshold: 2_000,
Copy link

@aurel-fr aurel-fr Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps lacking context, I would make these dynamic, retrievable from a chainconfig in aws appconfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% these defaults all need to be carefully considered.

Copy link
Contributor

@alessandromazza98 alessandromazza98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments. It's a bit complicated to understand everything deeply but I see the overall logic.

A bigger question that I have is: why not using tx-sitter for all these operations (at least the ones related to send a tx onchain) instead of re-creating a similar service? This way we could simplify a lot this code and lets tx-sitter (that we already have and maintain) take care of tx handling.

}

// Re-export for backwards compatibility
pub type BatcherHandle = Commands;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not directly call it BatcherHandle instead of immediately creating a helper type?

/// Trait for gas policy implementations.
///
/// Allows different gas strategies to be plugged into the batcher.
pub trait GasPolicyTrait: Send + Sync + 'static {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this trait needed? You're implementing it only for GasPolicy, you could probably remove it and directly use the concrete type

Copy link
Contributor Author

@0xOsiris 0xOsiris Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think batch sizing is actually very nuanced. e.g. The policy determining batch sizes may need to change. This just makes changing the policy require minimal code changes.

Fine either way though


// Fee pressure is amplified by positive trend (fees rising)
// and dampened by negative trend (fees falling)
let trend_factor = 1.0 + 0.5 * chain.base_fee_trend;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the formula / reason behind this opearation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should maybe write some better comments explaining the rationale here. But the constraint space we are working in is basically relieve back pressure as fast as possible while not making the base fee go exponential. There's two things we need to consider fee pressure (how expensive gas is) and queue pressure (how much pending work is backed up).

When gas is cheap and the queue is large, we target 90% block utilization. When gas is expensive, and the queue is short we target a 10% utilization. The trend of the base fee is taken into consideration as well. If the trend is an upward pressure on the base fee our target utilization should scale down accordingly.

Formulas:

fee_pressure = (base_fee - target_base_fee) / (max_base_fee - target_base_fee)          
  [clamped 0-1]                                                                              
                                                                                             
 queue_pressure = queue_depth / backlog_threshold                                        
  [clamped 0-1]                                                                              
                                                                                             
 trend_factor = 1.0 + (0.5 × base_fee_trend)                                             
                                                                                             
  adjusted_fee_pressure = fee_pressure × trend_factor                                     
  [clamped 0-1]                                                                              
                                                                                             
  net_pressure = adjusted_fee_pressure - queue_pressure                                   
  [range -1 to 1]                                                                            
                                                                                             
  target_utilization = 0.5 - (0.4 × net_pressure)                                         
  [clamped 0.1-0.9]                                                                          
                                                                                             
  gas_budget = block_gas_limit × target_utilization

}
}

/// Submit createManyAccounts with retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the retry here

@0xOsiris
Copy link
Contributor Author

left some comments. It's a bit complicated to understand everything deeply but I see the overall logic.

A bigger question that I have is: why not using tx-sitter for all these operations (at least the ones related to send a tx onchain) instead of re-creating a similar service? This way we could simplify a lot this code and lets tx-sitter (that we already have and maintain) take care of tx handling.

I think this is a valid take. I was simply increasing the robustness of the current implementation, but outsourcing transaction mining seems like a valid strategy. Generally my concern is that we are going to make the base fee go exponential without intelligent batch sizing, which this PR attempts to fix. Transaction mining however can be outsourced to tx-sitter you are correct. I would consider this out of scope for this PR though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants