Cache first failure building an overlay base DB to avoid repeated failures#3487
Cache first failure building an overlay base DB to avoid repeated failures#3487henrymercer wants to merge 20 commits intomainfrom
Conversation
Use [...languages].sort() instead of languages.sort() to avoid mutating the caller's array as a side effect.
There was a problem hiding this comment.
Pull request overview
This pull request implements a caching mechanism to avoid repeated failures when overlay analysis (improved incremental analysis) fails on a runner due to insufficient resources, typically disk space. The PR introduces a status tracking system that records failures in the Actions cache, allowing subsequent runs to skip overlay analysis automatically until conditions change (e.g., runner upgrade or new CodeQL version).
Changes:
- Added a new
src/overlay/status.tsmodule to track and persist overlay analysis failures via Actions cache - Modified
src/init-action-post-helper.tsto record failures when overlay base database builds are unsuccessful - Updated
src/config-utils.tsto check cached status and skip overlay analysis when previous failures are detected - Added two new feature flags:
OverlayAnalysisStatusCheckandOverlayAnalysisStatusSave - Modified
bundleDbfunction signature to accept anincludeDiagnosticsparameter - Reorganized overlay-related imports from
overlay-database-utilstooverlaymodule structure
Reviewed changes
Copilot reviewed 29 out of 31 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/overlay/status.ts | New module implementing status persistence using Actions cache with timeout handling |
| src/overlay/status.test.ts | Comprehensive unit tests for the new status tracking functionality |
| src/init-action-post-helper.ts | Integration to save failure status after unsuccessful overlay-base builds |
| src/init-action-post-helper.test.ts | Tests verifying status saving under different conditions (failure, success, disabled) |
| src/config-utils.ts | Integration to check cached status and skip overlay analysis when indicated |
| src/config-utils.test.ts | Tests for skipping overlay analysis based on cached status |
| src/feature-flags.ts | Added two new feature flags for status check and save operations |
| src/util.ts | Modified bundleDb signature to accept includeDiagnostics parameter |
| src/database-upload.ts | Updated bundleDb call to pass includeDiagnostics: false |
| src/debug-artifacts.ts | Updated bundleDb call to pass includeDiagnostics: true |
| src/doc-url.ts | Added new documentation URL for deleting Actions cache entries |
| src/testing-utils.ts | Updated import path from overlay-database-utils to overlay |
| src/status-report.ts | Updated import path from overlay-database-utils to overlay |
| src/overlay/index.ts | Updated import paths to use relative paths from parent directory |
| src/overlay/index.test.ts | Updated imports to match new module structure |
| src/init-action.ts | Updated import path from overlay-database-utils to overlay |
| src/analyze.ts | Updated import path from overlay-database-utils to overlay |
| src/analyze-action.ts | Updated import path from overlay-database-utils to overlay |
| src/codeql.ts | Updated databaseBundle signature to include includeDiagnostics parameter |
| lib/* | Generated JavaScript files reflecting all TypeScript changes |
mbg
left a comment
There was a problem hiding this comment.
Thanks for putting this together! There's a lot going on here, including some things I like a lot. I have left a bunch of detailed comments, and there are also some high-level points:
- From previous experience, we know that we need to be careful about the construction of cache keys. We should consider more thoroughly what we need to include in the keys here to not shoot ourselves in the foot. In particular, I'd like us to better identify the runner, the analysis (thinking about Advanced Setup), and guarding against changes we need to make to the implementation.
- Caches can interact poorly with feature flags, if the feature flags affect what might be in a cache. We currently have a number of FFs which affect the overlay analysis behaviour and analysis in general. We might want to include these in the cache key so that e.g. we don't roll out a feature flag that breaks all overlay base database builds, then roll back the feature flag, but are stuck with caches that indicate failure for two weeks.
- The decision whether a base database build failed is currently local to a single workflow run. Consider a scenario where we successfully built an overlay base database with CodeQL version X and uploaded it. Now we are running again for a new commit, but building the overlay base database fails with the same CodeQL version -- perhaps due to an intermittent failure. We upload the status file and block all future base overlay db builds for this CodeQL version and all PR runs from even trying to download the existing base db, which is still in the cache. Perhaps it would be worth checking whether there is an existing base db for the same CodeQL version in the cache?
There was a problem hiding this comment.
I like this reorganisation, especially using index.ts to allow import ... from "./overlay"
| /* | ||
| * We perform enablement checks for overlay analysis to avoid using it on runners that are too small | ||
| * to support it. However these checks cannot avoid every potential issue without being overly | ||
| * conservative. Therefore, if our enablement checks enable overlay analysis for a runner that is | ||
| * too small, we want to remember that, so that we will not try to use overlay analysis until | ||
| * something changes (e.g. a larger runner is provisioned, or a new CodeQL version is released). | ||
| * | ||
| * We use the Actions cache as a lightweight way of providing this functionality. | ||
| */ |
There was a problem hiding this comment.
This sort of top-of-file comment capturing the motivation / design is great. We should do more of this.
| // | ||
| // Limitation: this can still flip from "too small" to "large enough" and back again if the disk | ||
| // space fluctuates above and below a multiple of 10 GB. | ||
| const diskSpaceToNearest10Gb = `${10 * Math.floor(diskUsage.numTotalBytes / (10 * 1024 * 1024 * 1024))}GB`; |
There was a problem hiding this comment.
Design question: this fundamentally assumes that the CodeQL analysis typically runs on comparable runners. I.e. the assumption is that unless the amount of total disk space is increased deliberately, the runner specs are the same. Practically speaking, I'd expect that to be the case as well. However, I am not sure whether it is necessarily the case or this is an assumption we have made previously.
My concern is that, if a customer has a runner group for CodeQL containing runners with different specs, we might flip-flop on this -- I think you express that in the "Limitation" part of the comment. That wouldn't be a great experience for a customer if it happened. Is there a way we can mitigate this?
|
|
||
| function makeDiskUsage(totalGiB: number): DiskUsage { | ||
| return { | ||
| numTotalBytes: totalGiB * 1024 * 1024 * 1024, |
There was a problem hiding this comment.
Minor: could we define 1024 * 1024 * 1024 as a constant somewhere since it's used in this test and also in the code being tested, or take advantage of some more strongly typed library for units?
| try { | ||
| const foundKey = await waitForResultWithTimeLimit( | ||
| MAX_CACHE_OPERATION_MS, | ||
| actionsCache.restoreCache([statusFile], cacheKey), |
There was a problem hiding this comment.
The paths to restore are an implicit component of the cache key. In this case, if statusFile is different between store and restore, then the cache won't get restored here. Since the path depends on getTemporaryDirectory(), we are dependent on that returning the same path every time.
I see that our dependency caching implementation also relies on getTemporaryDirectory() returning the same path for Java/C#. We should probably check that this doesn't cause any issues and ideally move to something that's more reliably stable.
| `Improved incremental analysis was skipped because it failed previously on this runner. ` + | ||
| "Improved incremental analysis may require a significant amount of disk space on some repositories. " + | ||
| "If you want to enable improved incremental analysis, increase the disk space available " + | ||
| "to the runner. If that doesn't help, contact GitHub Support for further assistance.\n\n" + | ||
| "Improved incremental analysis will be automatically retried when the next version of CodeQL is released. " + | ||
| `You can also manually trigger a retry by [removing](${DocUrl.DELETE_ACTIONS_CACHE_ENTRIES}) \`codeql-overlay-status-*\` entries from the Actions cache.`, |
There was a problem hiding this comment.
Few thoughts here:
- Consider including the CodeQL CLI version somewhere?
- "this runner" might be misleading someone into thinking that we have defined the exact runner (based on name, IP, ...).
| `Improved incremental analysis was skipped because it failed previously on this runner. ` + | |
| "Improved incremental analysis may require a significant amount of disk space on some repositories. " + | |
| "If you want to enable improved incremental analysis, increase the disk space available " + | |
| "to the runner. If that doesn't help, contact GitHub Support for further assistance.\n\n" + | |
| "Improved incremental analysis will be automatically retried when the next version of CodeQL is released. " + | |
| `You can also manually trigger a retry by [removing](${DocUrl.DELETE_ACTIONS_CACHE_ENTRIES}) \`codeql-overlay-status-*\` entries from the Actions cache.`, | |
| `Improved incremental analysis was skipped because it previously failed for this repository with CodeQL version A.B.C on a runner with the same hardware resources. ` + | |
| "Improved incremental analysis may require a significant amount of disk space for some repositories. " + | |
| "If you want to enable improved incremental analysis, increase the disk space available " + | |
| "to the runner. If that doesn't help, contact GitHub Support for further assistance.\n\n" + | |
| "Improved incremental analysis will be automatically retried when the next version of CodeQL is released. " + | |
| `You can also manually trigger a retry by [removing](${DocUrl.DELETE_ACTIONS_CACHE_ENTRIES}) \`codeql-overlay-status-*\` entries from the Actions cache.`, |
| if ( | ||
| config.overlayDatabaseMode === OverlayDatabaseMode.OverlayBase && | ||
| process.env[EnvVar.ANALYZE_DID_COMPLETE_SUCCESSFULLY] !== "true" && | ||
| (await features.getValue(Feature.OverlayAnalysisStatusSave)) | ||
| ) { |
There was a problem hiding this comment.
Minor: The rest of this function only does something if these conditions are true. Consider changing this so that the function returns early if the conditions are false, so that the entire rest of the function isn't in the if body.
| const diskUsage = await checkDiskUsage(logger); | ||
| if (diskUsage === undefined) { | ||
| logger.warning( | ||
| "Failed to determine disk usage, so unable to save overlay status to the Actions cache.", |
There was a problem hiding this comment.
Minor: avoid using "so" as a connecting word instead of an inverted "because":
| "Failed to determine disk usage, so unable to save overlay status to the Actions cache.", | |
| "Unable to save overlay status to the Actions cache, because the available disk space could not be determined.", |
| logger.error( | ||
| "This job attempted to run with improved incremental analysis but it did not complete successfully. " + | ||
| "This may have been due to disk space constraints: using improved incremental analysis can " + | ||
| "require a significant amount of disk space for some repositories. " + |
There was a problem hiding this comment.
This part would probably be useful to log whether or not the state is successfully uploaded.
| "This may have been due to disk space constraints: using improved incremental analysis can " + | ||
| "require a significant amount of disk space for some repositories. " + | ||
| "This failure has been recorded in the Actions cache, so " + | ||
| "rerunning this job will run a new CodeQL analysis without improved incremental analysis. " + |
When overlay analysis (improved incremental analysis) fails on a runner — typically due to insufficient disk space — this PR records that failure in the Actions cache so that subsequent runs will skip overlay analysis automatically until something changes (e.g. a larger runner is provisioned or a new CodeQL version is released).
See the backlinked internal issue for more information.
I recommend reviewing the first commit separately from the rest as this moves the overlay utilities into their own directory.
Risk assessment
For internal use only. Please select the risk level of this change:
Which use cases does this change impact?
Workflow types:
dynamicworkflows (Default Setup, CCR, ...).Products:
analysis-kinds: code-scanning.Environments:
github.comand/or GitHub Enterprise Cloud with Data Residency.How did/will you validate this change?
.test.tsfiles).If something goes wrong after this change is released, what are the mitigation and rollback strategies?
How will you know if something goes wrong after this change is released?
Are there any special considerations for merging or releasing this change?
Merge / deployment checklist