Skip to content

Conversation

@amrmahdi
Copy link
Collaborator

Optimizes network and Docker daemon settings in custom AMIs for high-throughput ECR operations. These settings help maximize network utilization during cache import/export and registry push/pull operations.

We configure BBR congestion control which works better for sustained ECR transfers compared to the default. TCP buffers are increased to 16MB to accommodate the higher throughput, and tcp_slow_start_after_idle is disabled since CI builds make frequent connections to ECR. The Docker daemon is configured with max-concurrent-downloads/uploads=16 to parallelize registry operations.

During testing, we found that cache export time dropped from 162.7s to 118.0s with these network optimizations combined with 1000 MB/s EBS throughput. We tested 2000 MB/s EBS throughput but saw diminishing returns - the bottleneck shifted to ECR upload rate (~60-100 MB/s) and zstd compression (CPU-bound), so 1000 MB/s is the sweet spot.

…ations

Optimizes network and Docker daemon settings in custom AMIs for high-throughput
ECR operations. These settings help maximize network utilization during cache
import/export and registry push/pull operations.

We configure BBR congestion control which works better for sustained ECR
transfers compared to the default. TCP buffers are increased to 16MB to
accommodate the higher throughput, and tcp_slow_start_after_idle is disabled
since CI builds make frequent connections to ECR. The Docker daemon is
configured with max-concurrent-downloads/uploads=16 to parallelize registry
operations.

During testing, we found that cache export time dropped from 162.7s to 118.0s
with these network optimizations combined with 1000 MB/s EBS throughput. We
tested 2000 MB/s EBS throughput but saw diminishing returns - the bottleneck
shifted to ECR upload rate (~60-100 MB/s) and zstd compression (CPU-bound), so
1000 MB/s is the sweet spot.

Signed-off-by: Amr Mahdi <amrmahdi@meta.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds network optimization configurations to custom AMIs for high-throughput ECR operations during Docker builds. The changes configure BBR congestion control, increase TCP buffer sizes to 16MB, disable tcp_slow_start_after_idle, and set Docker daemon's concurrent downloads/uploads to 16. Testing showed cache export time improvements from 162.7s to 118.0s with these optimizations.

Key changes:

  • Added network tuning script with sysctl configurations and Docker daemon settings
  • Integrated the network configuration script into the Packer AMI build process

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
packer/cpu/scripts/configure-network.sh New script that configures network sysctl settings (BBR, TCP buffers) and Docker daemon concurrency settings for high-throughput operations
packer/cpu/buildkite-cpu-ami.pkr.hcl Added provisioner to execute the network configuration script early in the AMI build process

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +38 to +40
# Merge with existing config
sudo jq '. + {"max-concurrent-downloads": 16, "max-concurrent-uploads": 16}' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp
sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jq command reads from /etc/docker/daemon.json and writes output to a temporary file, but if jq fails for any reason (e.g., invalid JSON in the original file), the temporary file might be left in place or the original file could be left in an inconsistent state. Consider adding error handling to ensure atomic updates and cleanup.

Suggested change
# Merge with existing config
sudo jq '. + {"max-concurrent-downloads": 16, "max-concurrent-uploads": 16}' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp
sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json
# Merge with existing config using a temporary file for atomic update
tmpfile="$(mktemp /tmp/daemon.json.XXXXXX)"
trap 'rm -f "$tmpfile"' EXIT
if sudo jq '. + {"max-concurrent-downloads": 16, "max-concurrent-uploads": 16}' /etc/docker/daemon.json >"$tmpfile"; then
sudo mv "$tmpfile" /etc/docker/daemon.json
fi

Copilot uses AI. Check for mistakes.

cat <<'EOF' | sudo tee /etc/sysctl.d/99-vllm-network.conf
# Network tuning for high-throughput Docker builds
# Reference: https://docs.aws.amazon.com/datatransferterminal/latest/userguide/tech-requirements.html
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions 'Reference: https://docs.aws.amazon.com/datatransferterminal/latest/userguide/tech-requirements.html' but this URL appears to be for AWS Data Transfer Terminal, which may not be the most relevant reference for general Docker/ECR network tuning. Consider updating the reference to a more appropriate AWS documentation page for network optimization or removing it if not directly relevant.

Suggested change
# Reference: https://docs.aws.amazon.com/datatransferterminal/latest/userguide/tech-requirements.html

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants