Runner Pool #3629

yhakbar · 2024-12-05T19:07:37Z

Summary

Change the underlying concurrency model of Terragrunt so that a pool of runners are leveraged instead of run groups.

Add Units to the pool of runners when they are ready.

When all Units have completed their runs, end the run.

Motivation

Terragrunt currently runs Units in a concurrency model where Units are grouped based on their dependencies, and groups are run in parallel when they do not have any pending group they depend upon.

e.g.

$ terragrunt run-all plan
14:09:52.456 INFO   The stack at . will be processed in the following order for command plan:
Group 1
- Module ./unit-a
- Module ./unit-b

Group 2
- Module ./unit-depends-on-unit-a
- Module ./unit-depends-on-unit-b

This is a simple concurrency model, and is easy to display in logs.

Individual Units failing during runs can cause entire groups, and dependent groups to fail, ultimately meaning that individual failing Units can cause widespread failure for a Stack.

In addition, there is wasted time in a run, as groups execute when they have no dependent groups they are waiting on. A group dependent on another group will only start running when the slowest Unit in the dependency completes.

$ terragrunt run-all plan
14:09:52.456 INFO   The stack at . will be processed in the following order for command plan:
Group 1
- Module ./slow-unit
- Module ./fast-unit

Group 2
- Module ./unit-depends-on-slow-unit
- Module ./unit-depends-on-fast-unit

Proposal

When Terragrunt starts a run, create a runner pool and a Unit queue, then add unblocked Units to the pool from the queue and run them until the queue is empty.

This will make it so that runs can complete more efficiently and so that individual failing Units do not cause the entire run to fail.

Some users may prefer to support the current behavior where Terragrunt will fail early if an individual Unit fails, and to support that use-case, a --terragrunt-fail-early flag will be introduced.

Technical Details

Algorithm

The algorithm for working with the runner pool will be as follows:

Discover Units and add them to a queue with metadata regarding their dependencies.

Store that metadata as a slice of Units the discovered Unit is blocked by.

In addition store a slice of dependencies for each Unit.

Set the status of the Units to ready if they have an empty list of blocked by and set their status to blocked if they do not.

Sort the queue by:
1. Units with ready status first.
2. Units with more dependencies before Units with less dependencies.
Create a runner pool equal to the minimum of
1. The total number of Units
2. The configured maximum concurrency.
Do the following in a loop:
1. Add ready Units to the pool based on precedence (more dependencies go first) and set their status to pending until either:
  1. There is no space in the pool.
  2. There are no more ready Units in the queue.
2. Concurrently run all pending Units. This changes their status to running. When a run completes, it does the following:
  1. If the run was successful (exit code of 0):
    1. Change the status to succeeded.
    2. Remove the Unit from any blocked by in the queue.
      1. If any Unit has an empty blocked by as a consequence, set the status to ready.
    3. Remove the Unit from the pool.
  2. If the run failed (exit code ≠ 0):
    1. Change the status to failed.
    2. For any Units in the queue that were blocked by the Unit:
      1. Set their status to ancestor failed.
      2. Remove them from the queue.
      3. Recursively repeat for any Unit blocked by a removed Unit.
    3. If a user sets the --terragrunt-fail-fast flag, do the following for all remaining Units:
      1. Set their status to fail fast.
      2. Evict them from the queue.
    4. Remove the Unit from the pool.
3. Poll the queue for one of the following:
  1. The queue is empty, break the loop.
  2. The pool has space and one or more Units in the queue are ready.

Diagrams

Simple diagram of how units run in the current groups approach vs. in runner pools:

Worst case scenario of how this would impact performance:

In Groups

In Pools

The worst case scenario for the change to Runner Pools is that the total runtime for everything in the run queue is the same between the two approaches. You can see that by the blue 8s and green 6s combining to slow down the total execution in both.

Even in this scenario, note that the purple units that depend on red complete their runs faster, however. This is one of the main advantages of this approach. More overall concurrency is used on average, using more of available hardware.

Compare this to a best case scenario where the 8s blue unit has dropped down to a 4s runtime:

In Groups

In Pools

As you can see, because the slower green unit is no longer blocked by the entirety of group 1, the total run completes faster, and the purple units finish at the same timestamp.

Press Release

Introducing Terragrunt Runner Pools!

Starting with release x.y.z, Terragrunt now ships with an additional experimental concurrency model referred to as Runner Pools.

This new concurrency model allows users to perform large run-all invocations without individual failures impacting the success of the overall run, and allows runs to finish faster, on average.

To enable runner pools, leverage the following flag to opt-in:

TERRAGRUNT_EXPERIMENTAL_RUNNER_POOL=1

Drawbacks

This is a more complicated model than is currently used by Terragrunt, and may be more difficult to display in logs.

e.g.

$ terragrunt run-all plan
14:09:52.456 INFO   The stack at . has been added to the runner queue in the following order:
| ./unit-a                               |
| ./unit-b                               |
| ./unit-depends-on-unit-a               |
| --> Depends on: [ ./unit-a ]           |
| ./unit-depends-on-unit-b               |
| --> Depends on: [ ./unit-b ]           |
| ./unit-depends-on-unit-a-and-b         |
| --> Depends on: [ ./unit-a, ./unit-b ] |

There are also more opportunities to accidentally deadlock Terragrunt, as checks have to be done at multiple stages to proceed with a run.

A big change like this should also be opt-in initially, as any significant issue with the new concurrency model will probably make Terragrunt unusable for users. We'll also want to give users a chance to validate that the new model does improve performance in real production use-cases before forcing everyone to switch over. This can be a significant maintenance burden, and might make it hard to keep development velocity up.

Alternatives

Don't do it

The current model works, and works at fairly large scale. It is simpler to reason about and display. Recent additions like the error block also makes it easier to ignore failures, so users can simply ignore errors if they have flaky units they want to ignore.

Don't release Runner Pools as an experiment

It adds quite a bit of complexity to simultaneously support two concurrency models, and can lead to risk that bugs are introduced to either. The value of allowing users to use runner pools as an experiment is that they can test them out in the wild, but it might be better to just undergo extensive testing preemptively, then make this the only mechanism for running Terragrunt.

Migration Strategy

Start with this new concurrency model being an experimental opt-in feature. Users may prefer the old concurrency model, and they should be given time to try out the new model before it is the default.

This will also give time for the new concurrency model to be tested on production infrastructure for early adopters before all users are forced to adopt it when it becomes the default.

Unresolved Questions

What are the hidden risks to changing the concurrency model in this way?
This is a more complex concurrency model, and may require additional design to convey information about the way in which Terragrunt is going to run units in the terminal. How should Terragrunt help users understand what is happening, visually?
How long should the experiment run, and what determines success?

References

No response

Proof of Concept Pull Request

No response

Support Level

I have Terragrunt Enterprise Support
I am a paying Gruntwork customer

Customer Name

No response

The text was updated successfully, but these errors were encountered:

wakeful · 2024-12-10T15:15:06Z

From my experience in projects where we built many small units, this would be a total game changer and potentially lead to a huge speed increase!
I’m not sure how much telemetry you’re currently collecting from customers, but I’d love to measure runs across some of my projects. This could be a great way to "measure the success" of the experiment. I’m just not sure if you’re actually collecting data points when the end user doesn’t set DISABLE_TELEMETRY.

yhakbar · 2024-12-11T02:35:19Z

Terragrunt does not emit any telemetry other than the telemetry users can decide to emit and collect themselves via the OpenTelemetry integration.

We will look for ways to prove this out with benchmarks, internal infrastructures and feedback from the community.

yhakbar added rfc Request For Comments pending-decision Pending decision from maintainers labels Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner Pool #3629

Runner Pool #3629

yhakbar commented Dec 5, 2024

wakeful commented Dec 10, 2024

yhakbar commented Dec 11, 2024