Cache first failure building an overlay base DB to avoid repeated failures by henrymercer · Pull Request #3487 · github/codeql-action

henrymercer · 2026-02-17T15:57:37Z

When overlay analysis (improved incremental analysis) fails on a runner — typically due to insufficient disk space — this PR records that failure in the Actions cache so that subsequent runs will skip overlay analysis automatically until something changes (e.g. a larger runner is provisioned or a new CodeQL version is released).

See the backlinked internal issue for more information.

I recommend reviewing the first commit separately from the rest as this moves the overlay utilities into their own directory.

Risk assessment

For internal use only. Please select the risk level of this change:

Low risk: Changes are fully under feature flags, or have been fully tested and validated in pre-production environments and are highly observable, or are documentation or test only.

Which use cases does this change impact?

Workflow types:

Advanced setup - Impacts users who have custom CodeQL workflows.
Managed - Impacts users with dynamic workflows (Default Setup, CCR, ...).

Products:

Code Scanning - The changes impact analyses when analysis-kinds: code-scanning.

Environments:

Dotcom - Impacts CodeQL workflows on github.com and/or GitHub Enterprise Cloud with Data Residency.

How did/will you validate this change?

Test repository - This change will be tested on a test repository before merging.
Unit tests - I am depending on unit test coverage (i.e. tests in .test.ts files).

If something goes wrong after this change is released, what are the mitigation and rollback strategies?

Feature flags - All new or changed code paths can be fully disabled with corresponding feature flags.

How will you know if something goes wrong after this change is released?

Telemetry - I rely on existing telemetry or have made changes to the telemetry.
- Dashboards - I will watch relevant dashboards for issues after the release. Consider whether this requires this change to be released at a particular time rather than as part of a regular release.
- Alerts - New or existing monitors will trip if something goes wrong with this change.

Are there any special considerations for merging or releasing this change?

No special considerations - This change can be merged at any time.

Merge / deployment checklist

Confirm this change is backwards compatible with existing workflows.
Consider adding a changelog entry for this change.
Confirm the readme and docs have been updated if necessary.

Use [...languages].sort() instead of languages.sort() to avoid mutating the caller's array as a side effect.

Copilot

Pull request overview

This pull request implements a caching mechanism to avoid repeated failures when overlay analysis (improved incremental analysis) fails on a runner due to insufficient resources, typically disk space. The PR introduces a status tracking system that records failures in the Actions cache, allowing subsequent runs to skip overlay analysis automatically until conditions change (e.g., runner upgrade or new CodeQL version).

Changes:

Added a new src/overlay/status.ts module to track and persist overlay analysis failures via Actions cache
Modified src/init-action-post-helper.ts to record failures when overlay base database builds are unsuccessful
Updated src/config-utils.ts to check cached status and skip overlay analysis when previous failures are detected
Added two new feature flags: OverlayAnalysisStatusCheck and OverlayAnalysisStatusSave
Modified bundleDb function signature to accept an includeDiagnostics parameter
Reorganized overlay-related imports from overlay-database-utils to overlay module structure

Reviewed changes

Copilot reviewed 29 out of 31 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/overlay/status.ts	New module implementing status persistence using Actions cache with timeout handling
src/overlay/status.test.ts	Comprehensive unit tests for the new status tracking functionality
src/init-action-post-helper.ts	Integration to save failure status after unsuccessful overlay-base builds
src/init-action-post-helper.test.ts	Tests verifying status saving under different conditions (failure, success, disabled)
src/config-utils.ts	Integration to check cached status and skip overlay analysis when indicated
src/config-utils.test.ts	Tests for skipping overlay analysis based on cached status
src/feature-flags.ts	Added two new feature flags for status check and save operations
src/util.ts	Modified `bundleDb` signature to accept `includeDiagnostics` parameter
src/database-upload.ts	Updated `bundleDb` call to pass `includeDiagnostics: false`
src/debug-artifacts.ts	Updated `bundleDb` call to pass `includeDiagnostics: true`
src/doc-url.ts	Added new documentation URL for deleting Actions cache entries
src/testing-utils.ts	Updated import path from `overlay-database-utils` to `overlay`
src/status-report.ts	Updated import path from `overlay-database-utils` to `overlay`
src/overlay/index.ts	Updated import paths to use relative paths from parent directory
src/overlay/index.test.ts	Updated imports to match new module structure
src/init-action.ts	Updated import path from `overlay-database-utils` to `overlay`
src/analyze.ts	Updated import path from `overlay-database-utils` to `overlay`
src/analyze-action.ts	Updated import path from `overlay-database-utils` to `overlay`
src/codeql.ts	Updated `databaseBundle` signature to include `includeDiagnostics` parameter
lib/*	Generated JavaScript files reflecting all TypeScript changes

mbg

Thanks for putting this together! There's a lot going on here, including some things I like a lot. I have left a bunch of detailed comments, and there are also some high-level points:

From previous experience, we know that we need to be careful about the construction of cache keys. We should consider more thoroughly what we need to include in the keys here to not shoot ourselves in the foot. In particular, I'd like us to better identify the runner, the analysis (thinking about Advanced Setup), and guarding against changes we need to make to the implementation.
Caches can interact poorly with feature flags, if the feature flags affect what might be in a cache. We currently have a number of FFs which affect the overlay analysis behaviour and analysis in general. We might want to include these in the cache key so that e.g. we don't roll out a feature flag that breaks all overlay base database builds, then roll back the feature flag, but are stuck with caches that indicate failure for two weeks.
The decision whether a base database build failed is currently local to a single workflow run. Consider a scenario where we successfully built an overlay base database with CodeQL version X and uploaded it. Now we are running again for a new commit, but building the overlay base database fails with the same CodeQL version -- perhaps due to an intermittent failure. We upload the status file and block all future base overlay db builds for this CodeQL version and all PR runs from even trying to download the existing base db, which is still in the cache. Perhaps it would be worth checking whether there is an existing base db for the same CodeQL version in the cache?

mbg · 2026-02-17T20:53:37Z

src/overlay/index.ts

I like this reorganisation, especially using index.ts to allow import ... from "./overlay"

mbg · 2026-02-17T20:54:28Z

src/overlay/status.ts

+/*
+ * We perform enablement checks for overlay analysis to avoid using it on runners that are too small
+ * to support it. However these checks cannot avoid every potential issue without being overly
+ * conservative. Therefore, if our enablement checks enable overlay analysis for a runner that is
+ * too small, we want to remember that, so that we will not try to use overlay analysis until
+ * something changes (e.g. a larger runner is provisioned, or a new CodeQL version is released).
+ *
+ * We use the Actions cache as a lightweight way of providing this functionality.
+ */


This sort of top-of-file comment capturing the motivation / design is great. We should do more of this.

mbg · 2026-02-17T21:01:22Z

src/overlay/status.ts

+  //
+  // Limitation: this can still flip from "too small" to "large enough" and back again if the disk
+  // space fluctuates above and below a multiple of 10 GB.
+  const diskSpaceToNearest10Gb = `${10 * Math.floor(diskUsage.numTotalBytes / (10 * 1024 * 1024 * 1024))}GB`;


Design question: this fundamentally assumes that the CodeQL analysis typically runs on comparable runners. I.e. the assumption is that unless the amount of total disk space is increased deliberately, the runner specs are the same. Practically speaking, I'd expect that to be the case as well. However, I am not sure whether it is necessarily the case or this is an assumption we have made previously.

My concern is that, if a customer has a runner group for CodeQL containing runners with different specs, we might flip-flop on this -- I think you express that in the "Limitation" part of the comment. That wouldn't be a great experience for a customer if it happened. Is there a way we can mitigate this?

mbg · 2026-02-17T21:04:23Z

src/overlay/status.test.ts

+
+function makeDiskUsage(totalGiB: number): DiskUsage {
+  return {
+    numTotalBytes: totalGiB * 1024 * 1024 * 1024,


Minor: could we define 1024 * 1024 * 1024 as a constant somewhere since it's used in this test and also in the code being tested, or take advantage of some more strongly typed library for units?

mbg · 2026-02-17T22:38:25Z

src/overlay/status.ts

+  try {
+    const foundKey = await waitForResultWithTimeLimit(
+      MAX_CACHE_OPERATION_MS,
+      actionsCache.restoreCache([statusFile], cacheKey),


The paths to restore are an implicit component of the cache key. In this case, if statusFile is different between store and restore, then the cache won't get restored here. Since the path depends on getTemporaryDirectory(), we are dependent on that returning the same path every time.

I see that our dependency caching implementation also relies on getTemporaryDirectory() returning the same path for Java/C#. We should probably check that this doesn't cause any issues and ideally move to something that's more reliably stable.

mbg · 2026-02-17T23:51:57Z

src/config-utils.ts

+            `Improved incremental analysis was skipped because it failed previously on this runner. ` +
+            "Improved incremental analysis may require a significant amount of disk space on some repositories. " +
+            "If you want to enable improved incremental analysis, increase the disk space available " +
+            "to the runner. If that doesn't help, contact GitHub Support for further assistance.\n\n" +
+            "Improved incremental analysis will be automatically retried when the next version of CodeQL is released. " +
+            `You can also manually trigger a retry by [removing](${DocUrl.DELETE_ACTIONS_CACHE_ENTRIES}) \`codeql-overlay-status-*\` entries from the Actions cache.`,


Few thoughts here:

Consider including the CodeQL CLI version somewhere?

"this runner" might be misleading someone into thinking that we have defined the exact runner (based on name, IP, ...).

Suggested change

`Improved incremental analysis was skipped because it failed previously on this runner. ` +

"Improved incremental analysis may require a significant amount of disk space on some repositories. " +

"If you want to enable improved incremental analysis, increase the disk space available " +

"to the runner. If that doesn't help, contact GitHub Support for further assistance.\n\n" +

"Improved incremental analysis will be automatically retried when the next version of CodeQL is released. " +

`You can also manually trigger a retry by [removing](${DocUrl.DELETE_ACTIONS_CACHE_ENTRIES}) \`codeql-overlay-status-*\` entries from the Actions cache.`,

`Improved incremental analysis was skipped because it previously failed for this repository with CodeQL version A.B.C on a runner with the same hardware resources. ` +

"Improved incremental analysis may require a significant amount of disk space for some repositories. " +

"If you want to enable improved incremental analysis, increase the disk space available " +

"to the runner. If that doesn't help, contact GitHub Support for further assistance.\n\n" +

"Improved incremental analysis will be automatically retried when the next version of CodeQL is released. " +

`You can also manually trigger a retry by [removing](${DocUrl.DELETE_ACTIONS_CACHE_ENTRIES}) \`codeql-overlay-status-*\` entries from the Actions cache.`,

mbg · 2026-02-17T23:55:11Z

src/init-action-post-helper.ts

+  if (
+    config.overlayDatabaseMode === OverlayDatabaseMode.OverlayBase &&
+    process.env[EnvVar.ANALYZE_DID_COMPLETE_SUCCESSFULLY] !== "true" &&
+    (await features.getValue(Feature.OverlayAnalysisStatusSave))
+  ) {


Minor: The rest of this function only does something if these conditions are true. Consider changing this so that the function returns early if the conditions are false, so that the entire rest of the function isn't in the if body.

mbg · 2026-02-17T23:57:17Z

src/init-action-post-helper.ts

+    const diskUsage = await checkDiskUsage(logger);
+    if (diskUsage === undefined) {
+      logger.warning(
+        "Failed to determine disk usage, so unable to save overlay status to the Actions cache.",


Minor: avoid using "so" as a connecting word instead of an inverted "because":

Suggested change

"Failed to determine disk usage, so unable to save overlay status to the Actions cache.",

"Unable to save overlay status to the Actions cache, because the available disk space could not be determined.",

mbg · 2026-02-17T23:58:27Z

src/init-action-post-helper.ts

+      logger.error(
+        "This job attempted to run with improved incremental analysis but it did not complete successfully. " +
+          "This may have been due to disk space constraints: using improved incremental analysis can " +
+          "require a significant amount of disk space for some repositories. " +


This part would probably be useful to log whether or not the state is successfully uploaded.

mbg · 2026-02-17T23:58:43Z

src/init-action-post-helper.ts

+          "This may have been due to disk space constraints: using improved incremental analysis can " +
+          "require a significant amount of disk space for some repositories. " +
+          "This failure has been recorded in the Actions cache, so " +
+          "rerunning this job will run a new CodeQL analysis without improved incremental analysis. " +


Not an option for default setup.

henrymercer added 20 commits February 17, 2026 15:54

Create separate directory for overlay source code

d1bdc0e

Compute cache key for overlay language status

d28d996

Add save and restore methods

69c2819

Generalise status to multiple languages

e275d63

Skip overlay analysis based on cached status

ebad062

Save overlay status to Actions cache

96961e0

Introduce feature flags for saving and checking status

827bba6

Be more explicit about attempt to build overlay DB

6c405c2

Sort doc URLs

0c47ae1

Add status page diagnostic when overlay skipped

7b7a951

Only store overlay status if analysis failed

ef58c00

Improve diagnostic message wording

cc0dce0

Tweak diagnostic message

d24014a

Improve error messages

3dd1275

More error message improvements

554b931

Include diagnostics in bundle

5c583bb

Avoid mutating languages array in overlay status functions

05d4e25

Use [...languages].sort() instead of languages.sort() to avoid mutating the caller's array as a side effect.

Add tests for shouldSkipOverlayAnalysis

657f337

Extract status file path helper

fa56ea8

Improve log message

898ae16

github-actions bot added the size/XL May be very hard to review label Feb 17, 2026

henrymercer marked this pull request as ready for review February 17, 2026 18:11

henrymercer requested a review from a team as a code owner February 17, 2026 18:11

Copilot AI review requested due to automatic review settings February 17, 2026 18:11

Copilot started reviewing on behalf of henrymercer February 17, 2026 18:12 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

mbg reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache first failure building an overlay base DB to avoid repeated failures#3487

Cache first failure building an overlay base DB to avoid repeated failures#3487
henrymercer wants to merge 20 commits intomainfrom
henrymercer/overlay-status

henrymercer commented Feb 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mbg left a comment

Uh oh!

mbg Feb 17, 2026

Uh oh!

mbg Feb 17, 2026

Uh oh!

mbg Feb 17, 2026

Uh oh!

mbg Feb 17, 2026

Uh oh!

mbg Feb 17, 2026

Uh oh!

mbg Feb 17, 2026

Uh oh!

mbg Feb 17, 2026

Uh oh!

mbg Feb 17, 2026

Uh oh!

mbg Feb 17, 2026

Uh oh!

mbg Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"Failed to determine disk usage, so unable to save overlay status to the Actions cache.",
	"Unable to save overlay status to the Actions cache, because the available disk space could not be determined.",

Conversation

henrymercer commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Risk assessment

Which use cases does this change impact?

How did/will you validate this change?

If something goes wrong after this change is released, what are the mitigation and rollback strategies?

How will you know if something goes wrong after this change is released?

Are there any special considerations for merging or releasing this change?

Merge / deployment checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mbg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

henrymercer commented Feb 17, 2026 •

edited

Loading