-
Notifications
You must be signed in to change notification settings - Fork 58
[6/N][AMI BuildKit Cache] Add network tuning for high-throughput operations #267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ations Optimizes network and Docker daemon settings in custom AMIs for high-throughput ECR operations. These settings help maximize network utilization during cache import/export and registry push/pull operations. We configure BBR congestion control which works better for sustained ECR transfers compared to the default. TCP buffers are increased to 16MB to accommodate the higher throughput, and tcp_slow_start_after_idle is disabled since CI builds make frequent connections to ECR. The Docker daemon is configured with max-concurrent-downloads/uploads=16 to parallelize registry operations. During testing, we found that cache export time dropped from 162.7s to 118.0s with these network optimizations combined with 1000 MB/s EBS throughput. We tested 2000 MB/s EBS throughput but saw diminishing returns - the bottleneck shifted to ECR upload rate (~60-100 MB/s) and zstd compression (CPU-bound), so 1000 MB/s is the sweet spot. Signed-off-by: Amr Mahdi <amrmahdi@meta.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds network optimization configurations to custom AMIs for high-throughput ECR operations during Docker builds. The changes configure BBR congestion control, increase TCP buffer sizes to 16MB, disable tcp_slow_start_after_idle, and set Docker daemon's concurrent downloads/uploads to 16. Testing showed cache export time improvements from 162.7s to 118.0s with these optimizations.
Key changes:
- Added network tuning script with sysctl configurations and Docker daemon settings
- Integrated the network configuration script into the Packer AMI build process
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| packer/cpu/scripts/configure-network.sh | New script that configures network sysctl settings (BBR, TCP buffers) and Docker daemon concurrency settings for high-throughput operations |
| packer/cpu/buildkite-cpu-ami.pkr.hcl | Added provisioner to execute the network configuration script early in the AMI build process |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Merge with existing config | ||
| sudo jq '. + {"max-concurrent-downloads": 16, "max-concurrent-uploads": 16}' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp | ||
| sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jq command reads from /etc/docker/daemon.json and writes output to a temporary file, but if jq fails for any reason (e.g., invalid JSON in the original file), the temporary file might be left in place or the original file could be left in an inconsistent state. Consider adding error handling to ensure atomic updates and cleanup.
| # Merge with existing config | |
| sudo jq '. + {"max-concurrent-downloads": 16, "max-concurrent-uploads": 16}' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp | |
| sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json | |
| # Merge with existing config using a temporary file for atomic update | |
| tmpfile="$(mktemp /tmp/daemon.json.XXXXXX)" | |
| trap 'rm -f "$tmpfile"' EXIT | |
| if sudo jq '. + {"max-concurrent-downloads": 16, "max-concurrent-uploads": 16}' /etc/docker/daemon.json >"$tmpfile"; then | |
| sudo mv "$tmpfile" /etc/docker/daemon.json | |
| fi |
|
|
||
| cat <<'EOF' | sudo tee /etc/sysctl.d/99-vllm-network.conf | ||
| # Network tuning for high-throughput Docker builds | ||
| # Reference: https://docs.aws.amazon.com/datatransferterminal/latest/userguide/tech-requirements.html |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions 'Reference: https://docs.aws.amazon.com/datatransferterminal/latest/userguide/tech-requirements.html' but this URL appears to be for AWS Data Transfer Terminal, which may not be the most relevant reference for general Docker/ECR network tuning. Consider updating the reference to a more appropriate AWS documentation page for network optimization or removing it if not directly relevant.
| # Reference: https://docs.aws.amazon.com/datatransferterminal/latest/userguide/tech-requirements.html |
Optimizes network and Docker daemon settings in custom AMIs for high-throughput ECR operations. These settings help maximize network utilization during cache import/export and registry push/pull operations.
We configure BBR congestion control which works better for sustained ECR transfers compared to the default. TCP buffers are increased to 16MB to accommodate the higher throughput, and tcp_slow_start_after_idle is disabled since CI builds make frequent connections to ECR. The Docker daemon is configured with max-concurrent-downloads/uploads=16 to parallelize registry operations.
During testing, we found that cache export time dropped from 162.7s to 118.0s with these network optimizations combined with 1000 MB/s EBS throughput. We tested 2000 MB/s EBS throughput but saw diminishing returns - the bottleneck shifted to ECR upload rate (~60-100 MB/s) and zstd compression (CPU-bound), so 1000 MB/s is the sweet spot.