You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Change the underlying concurrency model of Terragrunt so that a pool of runners are leveraged instead of run groups.
Add Units to the pool of runners when they are ready.
When all Units have completed their runs, end the run.
Motivation
Terragrunt currently runs Units in a concurrency model where Units are grouped based on their dependencies, and groups are run in parallel when they do not have any pending group they depend upon.
e.g.
$ terragrunt run-all plan
14:09:52.456 INFO The stack at . will be processed in the following order forcommand plan:
Group 1
- Module ./unit-a
- Module ./unit-b
Group 2
- Module ./unit-depends-on-unit-a
- Module ./unit-depends-on-unit-b
This is a simple concurrency model, and is easy to display in logs.
Individual Units failing during runs can cause entire groups, and dependent groups to fail, ultimately meaning that individual failing Units can cause widespread failure for a Stack.
In addition, there is wasted time in a run, as groups execute when they have no dependent groups they are waiting on. A group dependent on another group will only start running when the slowest Unit in the dependency completes.
$ terragrunt run-all plan
14:09:52.456 INFO The stack at . will be processed in the following order forcommand plan:
Group 1
- Module ./slow-unit
- Module ./fast-unit
Group 2
- Module ./unit-depends-on-slow-unit
- Module ./unit-depends-on-fast-unit
Proposal
When Terragrunt starts a run, create a runner pool and a Unit queue, then add unblocked Units to the pool from the queue and run them until the queue is empty.
This will make it so that runs can complete more efficiently and so that individual failing Units do not cause the entire run to fail.
Some users may prefer to support the current behavior where Terragrunt will fail early if an individual Unit fails, and to support that use-case, a --terragrunt-fail-early flag will be introduced.
Technical Details
Algorithm
The algorithm for working with the runner pool will be as follows:
Discover Units and add them to a queue with metadata regarding their dependencies.
Store that metadata as a slice of Units the discovered Unit is blocked by.
In addition store a slice of dependencies for each Unit.
Set the status of the Units to ready if they have an empty list of blocked by and set their status to blocked if they do not.
Sort the queue by:
Units with ready status first.
Units with more dependencies before Units with less dependencies.
Create a runner pool equal to the minimum of
The total number of Units
The configured maximum concurrency.
Do the following in a loop:
Add ready Units to the pool based on precedence (more dependencies go first) and set their status to pending until either:
There is no space in the pool.
There are no more ready Units in the queue.
Concurrently run all pending Units. This changes their status to running. When a run completes, it does the following:
If the run was successful (exit code of 0):
Change the status to succeeded.
Remove the Unit from any blocked by in the queue.
If any Unit has an empty blocked by as a consequence, set the status to ready.
Remove the Unit from the pool.
If the run failed (exit code ≠ 0):
Change the status to failed.
For any Units in the queue that were blocked by the Unit:
Set their status to ancestor failed.
Remove them from the queue.
Recursively repeat for any Unit blocked by a removed Unit.
If a user sets the --terragrunt-fail-fast flag, do the following for all remaining Units:
Set their status to fail fast.
Evict them from the queue.
Remove the Unit from the pool.
Poll the queue for one of the following:
The queue is empty, break the loop.
The pool has space and one or more Units in the queue are ready.
Diagrams
Simple diagram of how units run in the current groups approach vs. in runner pools:
Worst case scenario of how this would impact performance:
In Groups
In Pools
The worst case scenario for the change to Runner Pools is that the total runtime for everything in the run queue is the same between the two approaches. You can see that by the blue 8s and green 6s combining to slow down the total execution in both.
Even in this scenario, note that the purple units that depend on red complete their runs faster, however. This is one of the main advantages of this approach. More overall concurrency is used on average, using more of available hardware.
Compare this to a best case scenario where the 8s blue unit has dropped down to a 4s runtime:
In Groups
In Pools
As you can see, because the slower green unit is no longer blocked by the entirety of group 1, the total run completes faster, and the purple units finish at the same timestamp.
Press Release
Introducing Terragrunt Runner Pools!
Starting with release x.y.z, Terragrunt now ships with an additional experimental concurrency model referred to as Runner Pools.
This new concurrency model allows users to perform large run-all invocations without individual failures impacting the success of the overall run, and allows runs to finish faster, on average.
To enable runner pools, leverage the following flag to opt-in:
TERRAGRUNT_EXPERIMENTAL_RUNNER_POOL=1
Drawbacks
This is a more complicated model than is currently used by Terragrunt, and may be more difficult to display in logs.
e.g.
$ terragrunt run-all plan
14:09:52.456 INFO The stack at . has been added to the runner queue in the following order:
| ./unit-a || ./unit-b || ./unit-depends-on-unit-a || --> Depends on: [ ./unit-a ] || ./unit-depends-on-unit-b || --> Depends on: [ ./unit-b ] || ./unit-depends-on-unit-a-and-b || --> Depends on: [ ./unit-a, ./unit-b ] |
There are also more opportunities to accidentally deadlock Terragrunt, as checks have to be done at multiple stages to proceed with a run.
A big change like this should also be opt-in initially, as any significant issue with the new concurrency model will probably make Terragrunt unusable for users. We'll also want to give users a chance to validate that the new model does improve performance in real production use-cases before forcing everyone to switch over. This can be a significant maintenance burden, and might make it hard to keep development velocity up.
Alternatives
Don't do it
The current model works, and works at fairly large scale. It is simpler to reason about and display. Recent additions like the error block also makes it easier to ignore failures, so users can simply ignore errors if they have flaky units they want to ignore.
Don't release Runner Pools as an experiment
It adds quite a bit of complexity to simultaneously support two concurrency models, and can lead to risk that bugs are introduced to either. The value of allowing users to use runner pools as an experiment is that they can test them out in the wild, but it might be better to just undergo extensive testing preemptively, then make this the only mechanism for running Terragrunt.
Migration Strategy
Start with this new concurrency model being an experimental opt-in feature. Users may prefer the old concurrency model, and they should be given time to try out the new model before it is the default.
This will also give time for the new concurrency model to be tested on production infrastructure for early adopters before all users are forced to adopt it when it becomes the default.
Unresolved Questions
What are the hidden risks to changing the concurrency model in this way?
This is a more complex concurrency model, and may require additional design to convey information about the way in which Terragrunt is going to run units in the terminal. How should Terragrunt help users understand what is happening, visually?
How long should the experiment run, and what determines success?
References
No response
Proof of Concept Pull Request
No response
Support Level
I have Terragrunt Enterprise Support
I am a paying Gruntwork customer
Customer Name
No response
The text was updated successfully, but these errors were encountered:
From my experience in projects where we built many small units, this would be a total game changer and potentially lead to a huge speed increase!
I’m not sure how much telemetry you’re currently collecting from customers, but I’d love to measure runs across some of my projects. This could be a great way to "measure the success" of the experiment. I’m just not sure if you’re actually collecting data points when the end user doesn’t set DISABLE_TELEMETRY.
Summary
Change the underlying concurrency model of Terragrunt so that a pool of runners are leveraged instead of run groups.
Add Units to the pool of runners when they are ready.
When all Units have completed their runs, end the run.
Motivation
Terragrunt currently runs Units in a concurrency model where Units are grouped based on their dependencies, and groups are run in parallel when they do not have any pending group they depend upon.
e.g.
This is a simple concurrency model, and is easy to display in logs.
Individual Units failing during runs can cause entire groups, and dependent groups to fail, ultimately meaning that individual failing Units can cause widespread failure for a Stack.
In addition, there is wasted time in a run, as groups execute when they have no dependent groups they are waiting on. A group dependent on another group will only start running when the slowest Unit in the dependency completes.
Proposal
When Terragrunt starts a run, create a runner pool and a Unit queue, then add unblocked Units to the pool from the queue and run them until the queue is empty.
This will make it so that runs can complete more efficiently and so that individual failing Units do not cause the entire run to fail.
Some users may prefer to support the current behavior where Terragrunt will fail early if an individual Unit fails, and to support that use-case, a
--terragrunt-fail-early
flag will be introduced.Technical Details
Algorithm
The algorithm for working with the runner pool will be as follows:
Discover Units and add them to a queue with metadata regarding their dependencies.
Store that metadata as a slice of Units the discovered Unit is
blocked by
.In addition store a slice of
dependencies
for each Unit.Set the status of the Units to
ready
if they have an empty list ofblocked by
and set their status toblocked
if they do not.Sort the queue by:
ready
status first.Create a runner pool equal to the minimum of
Do the following in a loop:
ready
Units to the pool based on precedence (more dependencies go first) and set their status topending
until either:ready
Units in the queue.pending
Units. This changes their status torunning
. When a run completes, it does the following:succeeded
.blocked by
in the queue.blocked by
as a consequence, set the status toready
.failed
.blocked by
the Unit:ancestor failed
.blocked by
a removed Unit.--terragrunt-fail-fast
flag, do the following for all remaining Units:fail fast
.ready
.Diagrams
Simple diagram of how units run in the current groups approach vs. in runner pools:
Worst case scenario of how this would impact performance:
In Groups
In Pools
The worst case scenario for the change to Runner Pools is that the total runtime for everything in the run queue is the same between the two approaches. You can see that by the blue 8s and green 6s combining to slow down the total execution in both.
Even in this scenario, note that the purple units that depend on red complete their runs faster, however. This is one of the main advantages of this approach. More overall concurrency is used on average, using more of available hardware.
Compare this to a best case scenario where the 8s blue unit has dropped down to a 4s runtime:
In Groups
In Pools
As you can see, because the slower green unit is no longer blocked by the entirety of group 1, the total run completes faster, and the purple units finish at the same timestamp.
Press Release
Introducing Terragrunt Runner Pools!
Starting with release x.y.z, Terragrunt now ships with an additional experimental concurrency model referred to as Runner Pools.
This new concurrency model allows users to perform large
run-all
invocations without individual failures impacting the success of the overall run, and allows runs to finish faster, on average.To enable runner pools, leverage the following flag to opt-in:
TERRAGRUNT_EXPERIMENTAL_RUNNER_POOL=1
Drawbacks
This is a more complicated model than is currently used by Terragrunt, and may be more difficult to display in logs.
e.g.
There are also more opportunities to accidentally deadlock Terragrunt, as checks have to be done at multiple stages to proceed with a run.
A big change like this should also be opt-in initially, as any significant issue with the new concurrency model will probably make Terragrunt unusable for users. We'll also want to give users a chance to validate that the new model does improve performance in real production use-cases before forcing everyone to switch over. This can be a significant maintenance burden, and might make it hard to keep development velocity up.
Alternatives
Don't do it
The current model works, and works at fairly large scale. It is simpler to reason about and display. Recent additions like the
error
block also makes it easier to ignore failures, so users can simply ignore errors if they have flaky units they want to ignore.Don't release Runner Pools as an experiment
It adds quite a bit of complexity to simultaneously support two concurrency models, and can lead to risk that bugs are introduced to either. The value of allowing users to use runner pools as an experiment is that they can test them out in the wild, but it might be better to just undergo extensive testing preemptively, then make this the only mechanism for running Terragrunt.
Migration Strategy
Start with this new concurrency model being an experimental opt-in feature. Users may prefer the old concurrency model, and they should be given time to try out the new model before it is the default.
This will also give time for the new concurrency model to be tested on production infrastructure for early adopters before all users are forced to adopt it when it becomes the default.
Unresolved Questions
References
No response
Proof of Concept Pull Request
No response
Support Level
Customer Name
No response
The text was updated successfully, but these errors were encountered: