Triage identifies clusters of similar test failures across all jobs.
Use it here: https://go.k8s.io/triage
Triage consists of two parts: a summarizer, which clusters together similar test failure messages, and a web page which can be used to browse the results. The web page is a static HTML page which grabs the results in JSON format, parses them, and displays them.
Triage summarization is generally run via update_summaries.sh
, which downloads the input files in
the correct format and passes them automatically to triage
. (File formats are listed below.)
However, summarization can be run directly with the following flags:
builds
: a path to a JSON file containing build informationprevious
(optional): a path to a previous output which can be used to maintain consistent cluster IDsowners
(optional): a path to a file that maps SIGs to the labels they own (see Methodology); no longer used as labels are read straight from test namesoutput
(optional): the path to where the output should be written to; defaults to./failure_data.json
output_slices
(optional): a pattern to be used when outputting slices, if desired (see Methodology); e.g.slices/failure_data_PREFIX.json
, wherePREFIX
will be replaced with some identifiernum_workers
(optional): the number of worker goroutines to spawn for parallelized functions; defaults to2*runtime.NumCPU()-1
. (Since CPU detection is unreliable in Kubernetes, we set it manually according to the number of CPUs in test-infra-periodics.yaml.)memoize
(optional): whether to memoize certain function results to JSON (and use previously memoized results if they exist); defaults to false...tests
: after all named flags are passed in, a space-delimited series of paths to files containing test information should be passed in as well
Triage uses klog for logging, so klog flags can be passed in as well.
The web page can be accessed at https://go.k8s.io/triage with the following options:
Date
: defaults to "today"; note that all usages of "today" on the page refer to the currently set dateShow clusters for SIG
: filter results by the SIG assigned to the majority of the tests; allows multi-selectInclude results from
: toggle between CI tests, PR tests, or bothSort by
: basic sortingInclude filter
/Exclude filter
: advanced regex filtering by field
Note that the clusters at the top of the web page are static, and must be added/removed manually. Simply adding a button to the HTML is enough.
Package berghelroach
contains a modified Levenshtein distance formula. Its only export is a Dist()
function.
Package summarize
depends on package berghelroach
and does the actual heavy lifting.
The entire process is orchestrated by update_summaries.sh
, as follows:
- Download all builds for the last 14 days from BigQuery.
- Download all failed tests for the last 14 days from BigQuery.
- Run
triage
:- Load the downloaded files, and convert them into a format that Go can handle better (i.e. by parsing numbers).
- Group the builds by their build paths, and the test failures by their test names.
- Load previous results (if any) to aid in computation.
- Create a local clustering of the test failures from step 2. This splits each group of test
failures into local clusters, i.e. groups of failures with similar failure texts. The mapping
at this point is
Test Name => Local Cluster Text => Group of Test Failures
. - Create a global clustering of the local clusters from the previous step, optionally using the
previous results. This takes each local cluster and attempts to find clusters from other tests
with similar cluster texts. If one is found, they are merged into a global cluster, with each
test's failures remaining separate within the global cluster. The mapping at this point is
Global Cluster Text => Test Name => Group of Test Failures
. - Transform the global clustering into a format that compresses better, and which is consumable by the web page.
- If a mapping of owners to owner prefixes (such as
sig-testing => [sig-testing]
) was provided as a flag, load it. - Annotate each cluster with an owner, by parsing the test name or using the provided mapping from the previous step. This can be used to filter the clusters by SIG on the web page.
- Write the results to a JSON file.
- If the
output_slices
flag is set, create individual files ("slices") for each owner. Also, split the results into 256 slices based on the cluster IDs. Write the slices to JSON files.
- Upload the results into Google Cloud Storage so they can be browsed via the web page.
Below are the file structures for the ingested and outputted files. ...
denotes a repetition of the
previous element. "x
Flag" denotes the file format of a file passed to flag x
of the summarizer.
{
"clustered": [
{
"key": string,
"id": string,
"text": string,
"spans": [
int,
...
],
"tests": [
{
"name": string,
"jobs": [
{
"name": string,
"builds": [
int,
...
]
},
...
]
},
...
],
"owner": string,
},
...
],
"builds": {
"jobs": {
string: ([int, ...] OR {int as string: int, ...}) // See the description of the jobCollection type
},
"cols": {
"started": [int, ...],
"tests_failed": [int, ...],
"elapsed": [int, ...],
"tests_run": [int, ...],
"result": [string, ...],
"executor": [string, ...],
"pr": [string, ...]
},
"job_paths": {
string: string,
...
},
}
}
[
{
"path": string,
"started": int as string,
"elapsed": int as string,
"tests_run": int as string,
"tests_failed": int as string,
"result": string,
"executor": string,
"job": string,
"number": int as string,
"pr": string,
"key": string
},
...
]
This is a newline-delimited list of JSON objects. Note the lack of comma between objects.
{
"started": int as string,
"build": string,
"name": string,
"failure_text": string
}
...
See Main Output.
{
string: [
string,
...
],
...
}
See Main Output. This is only a subset of the main output.
See: package.json
+ ./hack/build/ensure-node_modules.sh
Triage runs as static HTML hosted in GCS that is updated as part of a Prow Periodic.
To update the triage image run make push
from ./triage
which will trigger a cloudbuild using //images/builder
. This will result in a fresh triage image within the cloud image registry of the k8s-testimages
project. (See Container Registry -> Images)
To update Triage frontend in Production or Staging manually run make push-static
or make push-staging
respectively. Otherwise it is updated on postsubmit via post-test-infra-upload-triage.
To access staging see Triage Staging.