Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warm pool with new ASG lifecycle hooks #966

Conversation

keithduncan
Copy link
Contributor

@keithduncan keithduncan commented Nov 25, 2021

Builds on the lifecycle hooks added in #964 and incorporates the warm pool configuration from #838.

Makes the boothook lifecycle hook conditional on the use of warm pool so that instances once again always start the agent on instance boot, unless warm pool is enabled. When warm pool is enabled, we delay agent start until the boothook triggered by the instance moving to InService.

This also changes how we parse the InstanceType parameter and adds an InstanceTypes count parameter. Warm pool is incompatible with MixedInstanceTypes so we need to be able to switch the Auto Scaling group resource between a static LaunchTemplate reference and a MixedInstanceTypes specification. Warm pool only supports single instance type, and on-demand instance types, using either precludes the use of an ASG warm pool. Adding the count parameter is copied from the AWS VPC Quick Start which uses a similar pattern.


TODO

  • Fix the template Rule added to prevent incompatible options across spot, multiple instance types, and warm pool
  • Find wait command for Windows instances that the BootHookWarmedAutomation can use to wait for ec2config, or decide that Windows instances don’t get Warm Pool support and block it using a CloudFormation template Rule 😢
  • Confirm how the NVMe InstanceStorage setting interoperates with 'stopped' EC2 instances. Since the warm pool instances are stopped and re-started, they will come up on new hardware without initialised NVMe drives. If this is irreconcilable these might be incompatible options which should be disabled using a template Rules entry.
  • Will likely need a tunable parameter for the BootHook HeartbeatTimeout property. This is statically 5 and 10 minutes for Linux and Windows respectively right now. Presently the same launch hook is used for both Warm Pool launch events, direct to ASG launch events, and Warm Pool to ASG launch events, meaning they all have the same timeout applied. We likely want to apply distinct timeouts to each type of lifecycle movement. Launched into the warm pool the timeout should be enough for the UserData script to run to completion, launched directly into the ASG the timeout should be enough for the UserData script and the agent to start, and transitioned from the Warm Pool into the ASG should just be enough to start the agent.
  • Warm Pool booted instances will cache the buildkite-agent token from the SSM parameter store. If you roll the token, you should also perform an Instance Refresh to ensure that none of the warm pool instances are sitting there with a stale token. Ideally the agent would fetch the token live and the unencrypted token wouldn’t be written to disk, but that is what it is. There may be some systemd environment hook where we could derive the token live at process start time and so should the process exit a new token be refetched.
  • Possibly change the warm pool configuration from being a min size to a max size. By default the warm pool with keep the ASG max size number of instances around and stopped, for large MaxSize stacks that could be excessive. Capping max size rather than providing a floor for min size seems more useful and strikes a balance between latency sensitive customers and cost sensitive customers.

@keithduncan keithduncan force-pushed the keithduncan/add-asg-lifecycle-warm-pool branch from 42c4ed6 to d898596 Compare November 26, 2021 03:36
@keithduncan keithduncan force-pushed the keithduncan/add-asg-lifecycle-warm-pool branch from 1f26405 to 2d63aa1 Compare November 26, 2021 03:54
@keithduncan
Copy link
Contributor Author

keithduncan commented Nov 30, 2021

Screenshot 2021-11-30 at 15 48 01

This is looking pretty good with the exception of the above noted todos.

The instances take about 40s to transition from stopped to inservice and running. This isn’t a massive improvement but if you have a long running BootstrapScriptUrl then it could be.

@joeljeske
Copy link

Is there any progress on this? This feature would be extremely valuable.

We currently have a very long running startup script (~30mins) and currently do not have a good strategy for mitigating this boot-up time, except for setting a large ScaleOutFactor and a large ScaleInIdlePeriod, which is also relatively expensive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants