Skip to content

Commit e630f78

Browse files
docs: Propose build batching for Bazel. (#9425)
1 parent 06d20b8 commit e630f78

File tree

1 file changed

+138
-0
lines changed

1 file changed

+138
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Concurrent Bazel Builds
2+
3+
* Author(s): Seth Nelson
4+
* Design Shepherd:
5+
* Date: 2024/05/21
6+
* Status: [Reviewed/Cancelled/Under implementation/Complete]
7+
8+
## Background
9+
10+
Bazel supports high levels of concurrency in its action graph model; however, Skaffold is not currently able to
11+
take advantage of that concurrency to build multiple targets because it is modeled around a separate builder
12+
invocation per artifact. This does not play nicely with Bazel's workspace lock model; invoking multiple builds
13+
in parallel as separate processes forces them to serialize. This effectively renders Skaffold's `--build-concurrency`
14+
null and void (see https://github.com/GoogleContainerTools/skaffold/issues/6047), and results in longer overall
15+
build times than if multiple target's action execution could interleave.
16+
17+
This document proposes changes to Skaffold which will enable multiple targets to be build in a single Bazel invocation.
18+
19+
20+
## Design
21+
22+
### Where to break the "one artifact per build" abstraction
23+
24+
As part of this design, I considered three places to introduce the concept of multi-artifact builds:
25+
26+
1. For all builds - in the `builder_mux`, group builds before scheduling.
27+
2. For the local builder pipeline - allow any underlying artifactBuilder to implement a `BuildMultiple` method,
28+
and wrap that in a `BatchingBuilder` which can queue up multiple incoming builds before sending groups into `BuildMultiple`.
29+
3. For Bazel only, where batching happens as an implementation detail with `bazel/build.go`.
30+
31+
Ultimately, this design proposes #3.
32+
33+
* While #1 results (in my opinion) in the cleanest abstraction model (after all, builder_mux is the demux point for
34+
turning a list of artifacts into a set of builds to be scheduled), it was rejected due to its invasiveness and
35+
complexity. Implementing this at the `builder_mux` level results in a
36+
ton of method forking (into multi-artifact variants); it also introduces significant scheduling complexity (to avoid
37+
cycles introduced by batching build targets). This complexity did not feel warranted for a feature that is likely
38+
to only be used by Bazel.
39+
* #2 was rejected because it would incorrectly (continue to) couple the compilation vs load/push phases of Bazel's `artifactBuilder` implementation.
40+
We only want to group up the actual `bazel build` invocation, not the subsequent `docker.Push`/`b.loadImage` which
41+
is also part of the `Build` implementation. Batching at the `Build` implementation level would (a) reduces the speed gains, because we would serialize the loads/pushes,
42+
and (b) adds implementation complexity, because we need to handle BatchingBuilders with different `artifactBuilder`
43+
constructor parameters (like loadImages).
44+
45+
46+
### Design Overview
47+
48+
This design proposes implementing build batching for Bazel builds by altering the Bazel Builder to build multiple
49+
JARs, and then multiplexing contemporaneous builds (e.g. waiting for a period of time for builds targets to
50+
"collect" and then building them all together).
51+
52+
* This changes no existing externally-visible methods/interfaces.
53+
* This does not _require_ config changes, though we could consider (a) initially implementing this functionality hidden behind a configuration flag, for rollout safety, and (b) configuring the batch window.
54+
* This design naturally respects the `--build_concurrency` flag, which limits how many builds will be scheduled contemporaneously.
55+
56+
### Design Elements
57+
58+
#### Bazel Builder can build multiple TARs
59+
60+
This is an in-place change; we replace `buildTar` with `buildTars`, which builds up a list of `buildTargets`.
61+
62+
#### computeBatchKey function
63+
64+
A `computeBatchKey` function which returns the same value if two different builds can be grouped together. Elements of
65+
the Bazel batch key include:
66+
* The sorted list of build args
67+
* The (currently, single) bazel platform flag
68+
* The artifact's workspace
69+
70+
#### batchingBuilder
71+
72+
A BatchingBuilder wraps an underlying multi-artifact build function (the `buildTars` function), and maintains a map of
73+
non-started `MultiArtifactBuild` objects (per batch key).
74+
75+
```go
76+
type batchingBuilder struct {
77+
builder multiJarBuilder
78+
batchSize time.Duration
79+
builds map[string]*MultiArtifactBuild
80+
mu sync.Mutex
81+
}
82+
```
83+
84+
```go
85+
type MultiArtifactBuild struct {
86+
ctx context.Context
87+
artifacts []*latest.Artifact
88+
platforms platform.Matcher
89+
workspace string
90+
mu sync.Mutex
91+
doneCond chan bool
92+
running bool
93+
out io.Writer
94+
digest map[*latest.BazelArtifact]string
95+
err error
96+
}
97+
```
98+
99+
When it is asked to build an artifact, it first checks to see if there is an existing build for the batch key.
100+
* If there is, it simply adds the artifact to the build's `artifacts` list.
101+
* If there is not, it creates a new MultiArtifact build, and kicks off a goroutine that will wait `batchSize` before
102+
invoking its `multiJarBuilder`.
103+
104+
#### How this impacts logging and metrics:
105+
106+
Given our choice in where to break the "one artifact per build" abstraction, this design presents some rough edges
107+
around logging and metrics that are worth acknowledging:
108+
* If any build in a batch fails, they are all reflected as failed to the Events API / console output.
109+
* This is fundamentally unavoidable in any implementation that involves a single bazel invocation building multiple
110+
targets.
111+
* Only one build's `OutputWriter` is actually used (the first one to create the `MultiArtifactBuild` object).
112+
* This
113+
presents some rough edges in my POC that I could use some help ironing out - depending on scheduling order, sometimes
114+
it fails to stream output to the console until the build is complete.
115+
116+
### Open Questions
117+
118+
**\<Should we include either of the configuration knobs listed in the Design Overview>**
119+
120+
Resolution: __Not Yet Resolved__
121+
122+
123+
## Implementation plan
124+
125+
I believe this change is well-contained enough to be implemented in a single PR, please let me know what you think.
126+
You can find a functional proof-of-concept here: https://github.com/GoogleContainerTools/skaffold/compare/main...sethnelson-tecton:skaffold:sethn/parallel-builds
127+
128+
___
129+
130+
131+
## Integration test plan
132+
133+
New test cases (in addition to status quo test cases of unbatched builds):
134+
135+
1. Batched builds that can all batch together (2 artifacts, no dependencies, build-concurrency>=2)
136+
1. Batched builds that should not batch together due to concurrency limits (4 artifacts, build-concurrency>=2)
137+
1. Batched builds that should not batch together due to scheduling (2 artifacts, one depends on the other, build-concurrency>=2)
138+
1. Batched builds with different build args

0 commit comments

Comments
 (0)