-
Notifications
You must be signed in to change notification settings - Fork 39
Limit Konflux build attempts to 1 #2032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/lgtm |
|
/hold |
da22b82 to
5350cf1
Compare
|
New changes are detected. LGTM label has been removed. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
5350cf1 to
b0d239e
Compare
WalkthroughAdded tenacity retry decorators to Konflux DB access and to two async methods in the Konflux image builder; removed the builder's internal retry loop, added a retry-backed attestation/signature validation step, and made DB writes conditional on dry-run, with outcome-driven error handling. Changes
Sequence DiagramsequenceDiagram
actor Builder
participant ImageBuilder as KonfluxImageBuilder
participant PipelineRun
participant Attestation as Validation
participant KonfluxDB
Builder->>ImageBuilder: start_build()
rect rgb(200,220,255)
Note over ImageBuilder,PipelineRun: Single start_build invocation (no retry loop)
ImageBuilder->>PipelineRun: create/start PipelineRun
PipelineRun-->>ImageBuilder: PipelineRun info
end
rect rgb(220,240,220)
Note over ImageBuilder,Attestation: Validation step (retry-decorated)
ImageBuilder->>Attestation: _validate_build_attestation_and_signature(image_pullspec, distgit_key)
Attestation->>Attestation: verify SLSA attestation & source signature (retries)
alt validation succeeds
Attestation-->>ImageBuilder: validation OK
else validation fails
Attestation-->>ImageBuilder: validation FAILED -> outcome = FAILURE
end
end
rect rgb(240,240,200)
Note over ImageBuilder,KonfluxDB: Conditional DB update (retry-decorated)
alt dry_run == false
ImageBuilder->>KonfluxDB: update_konflux_db(metadata, ..., outcome) (retries)
KonfluxDB-->>ImageBuilder: updated
else dry_run == true
ImageBuilder->>ImageBuilder: skip DB write
end
end
alt outcome == SUCCESS
ImageBuilder-->>Builder: return success + metadata
else outcome != SUCCESS
ImageBuilder-->>Builder: raise KonfluxImageBuildError
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🪛 Ruff (0.14.4)doozer/doozerlib/backend/konflux_image_builder.py227-227: Do not catch blind exception: (BLE001) 228-230: Use Replace with (TRY400) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (3)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
doozer/doozerlib/backend/konflux_image_builder.py (1)
653-719:update_konflux_dbretry scope andbuild_prioritycasting require fixesTwo behavioral issues confirmed in the implementation:
build_prioritymay beNoneand causeint(build_priority)to fail
Whenself._config.build_prioritydefaults toNone, the logic at lines 173–179 assignsNonedirectly tobuild_priority(the else branch executes when not"auto"). ThisNonevalue flows through toupdate_konflux_dband reaches line 718 whereint(build_priority)raises aTypeError. The retry decorator then wastes 15 seconds retrying this deterministic crash.Fix: Add validation in
build()to treatNoneas"auto", or raise a clear configuration error early ifbuild_priorityis not set to an acceptable value.
@retryretries all exceptions, including persistent deterministic errors
The decorator has noretry_onparameter, so it retries any exception—includingKeyErrorfrom missing Dockerfile labels (lines 674–679) andValueErrorfrom schema validation inadd_build(). These failures are deterministic and won't be fixed by retries; retrying only adds delay and can cause multiple inserts if failure occurs after a partial side-effect.Fix: Restrict retries to transient failures only (e.g., BigQuery transport errors), not configuration/validation errors. Check
add_build()error types and configure@retryto only handle known transient exceptions.
🧹 Nitpick comments (2)
artcommon/artcommonlib/konflux/konflux_db.py (1)
967-1024: Narrow retry conditions forget_build_records_by_nvrsto avoid retrying non-transient errorsDecorating
get_build_records_by_nvrswith tenacity is a good fit for transient BigQuery / networking issues, and the aggregation of per-NVR errors into a singleIOErroris clear.Right now, the default
@retrybehavior will treat all exceptions as retriable, including:
- The local
ValueErrorraised whenwherecontainsnvroroutcome.- Any
IOErrorcoming from strict “record not found” conditions inget_latest_build(if this method is used withstrict=True).That means callers can see unnecessary 10s+ delays before surfacing deterministic errors.
Consider tightening the retry condition, for example by:
- Using
retry=retry_if_exception_type(<transient-exception-types>)(e.g. IO/transport/BigQuery client errors), and/or- Separating “not found” from genuine connectivity failures (e.g. with a custom exception type) so only the latter are retried.
This keeps the new resilience benefits while avoiding retries for parameter validation and expected “not found” cases.
doozer/doozerlib/backend/konflux_image_builder.py (1)
295-337: Scope the retry behavior in_validate_build_attestation_and_signatureto truly transient failuresAdding
_validate_build_attestation_and_signatureand decorating it with tenacity is a good way to encapsulate SLSA + signature validation and handle transient issues fromfetch_slsa_attestation/get_konflux_data.A couple of refinements:
Retrying deterministic validation errors
- The function raises
ValueErrorfor an empty attestation and for a missingIMAGE_REFin the tasks.- With the default
@retry(reraise=True, ...), theseValueErrors are also treated as retriable and will be re-run 2 extra times with 5-second waits, even though they are deterministic once the attestation payload is fetched.To keep retries focused on genuinely transient problems (CLI/network/registry hiccups), consider restricting the retry condition to the expected transient exception types (e.g.,
ChildProcessErrorfrom the underlying commands, or specific I/O/transport errors).Exception logging at callsite
- In
build(), you currently doexcept Exception as e:and log vialogger.error(...). This swallows the traceback and triggers the Ruff hint.- Switching to
logger.exception(...)there would both address the lint warning and make it easier to debug unexpected exceptions from this helper.These changes keep the resilience benefits but avoid unnecessary delays when validation fails due to data/contract issues rather than transport flakiness.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
artcommon/artcommonlib/konflux/konflux_db.py(2 hunks)doozer/doozerlib/backend/konflux_image_builder.py(3 hunks)
🧰 Additional context used
🪛 Ruff (0.14.4)
doozer/doozerlib/backend/konflux_image_builder.py
227-227: Do not catch blind exception: Exception
(BLE001)
228-230: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: tests
🔇 Additional comments (3)
artcommon/artcommonlib/konflux/konflux_db.py (1)
20-20: Tenacity import wiring looks fineImporting
retry,stop_after_attempt, andwait_fixedhere lines up with the new retry decorator onget_build_records_by_nvrsand keeps retry semantics co-located with the DB client.doozer/doozerlib/backend/konflux_image_builder.py (2)
35-35: Tenacity import aligns with new retryable operationsThe
tenacityimport here is consistent with the new retry decorators on_validate_build_attestation_and_signature,update_konflux_db, and the internal SBOM helper, which matches the PR goal of moving retries closer to the potentially flaky I/O.
570-605: Tenacity on_get_sbom_with_retryis a good fit for flaky cosign callsThe inner
_get_sbom_with_retryhelper wrapped with@retry(stop=stop_after_attempt(3), wait=wait_fixed(5))is an appropriate place to add retries: it’s tightly scoped around the cosign CLI call and SBOM validation and has no additional side effects.No changes needed here; this is a clean use of tenacity for transient failures.
| pipelinerun_info = await self._start_build( | ||
| metadata=metadata, | ||
| build_repo=build_repo, | ||
| building_arches=building_arches, | ||
| output_image=output_image, | ||
| additional_tags=additional_tags, | ||
| nvr=nvr, | ||
| build_priority=build_priority, | ||
| dest_dir=dest_dir, | ||
| ) | ||
| pipelinerun_name = pipelinerun_info.name | ||
| record["task_id"] = pipelinerun_name | ||
| record["task_url"] = self._konflux_client.resource_url(pipelinerun_info.to_dict()) | ||
| await self.update_konflux_db( | ||
| metadata, | ||
| build_repo, | ||
| pipelinerun_info, | ||
| KonfluxBuildOutcome.PENDING, | ||
| building_arches, | ||
| build_priority, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Dry-run still writes a PENDING record and error handling around outcome could be tightened
The refactored single-attempt build sequence (start → wait → compute outcome → validate attestation/signature → write final status) is much clearer and matches the intent to avoid multiple PipelineRun creations.
A couple of behavioral issues to consider:
-
Dry-run mode still writes to the Konflux DB (PENDING record)
- You always call
await self.update_konflux_db(..., outcome=KonfluxBuildOutcome.PENDING, ...)right after starting the build (Line 194), regardless ofself._config.dry_run. - Later, in Lines 233–238, you skip the final DB write when
dry_runis true, but the initial PENDING write still happens. - The log message “Dry run: Would have inserted build record in Konflux DB” suggests the intent is to avoid DB writes entirely in dry-run mode.
If dry-run is meant to be side-effect free w.r.t BigQuery, you likely want either:
- To guard the initial PENDING write with the same
if self._config.dry_runcondition, or - To make
update_konflux_dbitself a no-op (early-return) whenself._config.dry_runis true.
Otherwise, a dry-run will leave a PENDING record in the DB, which may pollute metrics and tooling that depend on build status.
- You always call
-
Outcome / error handling and logging on validation failure
- When
_validate_build_attestation_and_signaturefails, you setoutcome = KonfluxBuildOutcome.FAILUREand proceed to the final DB update, which is the right behavior from a releaseability standpoint. - However, for failures here you only log
logger.error(...)without a traceback. Given these errors are often subtle (attestation shape, JSON content, signatures), a full stack trace will make debugging much easier.
Consider switching that logger call to
logger.exception(...)so the stack trace is captured, or at least usingLOGGER.exceptionwith the same message. - When
Overall flow looks good; fixing the dry-run DB write and improving logging here will better align behavior with expectations and make failures easier to diagnose.
Also applies to: 233-249
I'll investigate the code to verify the concerns raised in the review comment.
---
🏁 Script executed:
```shell
#!/bin/bash
# Extract the function containing lines 181-201 to see full context
cat doozer/doozerlib/backend/konflux_image_builder.py | head -300 | tail -130
Length of output: 6143
🏁 Script executed:
#!/bin/bash
# Show lines 233-249 explicitly
sed -n '233,249p' doozer/doozerlib/backend/konflux_image_builder.pyLength of output: 764
🏁 Script executed:
#!/bin/bash
# Find update_konflux_db method implementation
ast-grep --pattern $'function update_konflux_db($_, $_) {
$$$
}'Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for _validate_build_attestation_and_signature calls and error handling
rg -n '_validate_build_attestation_and_signature' -A 10 -B 2Length of output: 2771
🏁 Script executed:
#!/bin/bash
# Search for dry_run checks in this file
rg -n 'dry_run' doozer/doozerlib/backend/konflux_image_builder.py -B 2 -A 2Length of output: 834
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for update_konflux_db method definition
rg -n 'def update_konflux_db' -A 15
Length of output: 5252
🏁 Script executed:
#!/bin/bash
# Also check if it's an async method
rg -n 'async def update_konflux_db' -A 15Length of output: 2559
🏁 Script executed:
#!/bin/bash
# Check the full flow from line 194 to 237 - extract exact line numbers
sed -n '190,240p' doozer/doozerlib/backend/konflux_image_builder.py | cat -nLength of output: 2891
Guard initial PENDING record write with dry-run check, and improve error logging for attestation validation failures
Two confirmed issues:
-
Dry-run writes inconsistent PENDING records to Konflux DB (line 194)
The unguardedawait self.update_konflux_db(..., KonfluxBuildOutcome.PENDING, ...)at line 194 always executes, even whenself._config.dry_runis true. The final DB write (lines 233–238) correctly skips on dry-run, but the initial PENDING write doesn't, leaving stale records. Wrap line 194 withif not self._config.dry_run:or move the check insideupdate_konflux_db. -
Missing stack trace for attestation/signature validation errors (lines 226–230)
When_validate_build_attestation_and_signaturefails, the exception is caught and logged usinglogger.error(...)with only the stringified error message. Switch tologger.exception(...)to capture the full stack trace, which is essential for debugging signature and attestation shape issues.
🤖 Prompt for AI Agents
In doozer/doozerlib/backend/konflux_image_builder.py around lines 181 to 201,
the initial PENDING DB write at line ~194 is executed even during dry-run and
the attestation/signature validation error logging (around lines 226–230) loses
stack traces; guard the initial update_konflux_db call with a dry-run check
(e.g., if not self._config.dry_run: await self.update_konflux_db(...,
KonfluxBuildOutcome.PENDING, ...)) or move the dry-run check inside
update_konflux_db so no PENDING record is written during dry runs, and replace
logger.error(str(err)) in the exception handler for
_validate_build_attestation_and_signature with logger.exception(...) (or
logger.error(..., exc_info=True)) so the full stack trace is recorded.
b0d239e to
9a6d295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
doozer/doozerlib/backend/konflux_image_builder.py (1)
181-201: Guard the initial PENDING DB write with dry-run check to prevent stale records.The initial call to
update_konflux_dbwithKonfluxBuildOutcome.PENDING(line 194) executes unconditionally, even whenself._config.dry_runis true. However, the final DB write at lines 233-238 correctly skips updates during dry-run. This inconsistency leaves stale PENDING records in the Konflux DB during dry-run mode, polluting metrics and health checks.Wrap the initial DB write with a dry-run check:
pipelinerun_name = pipelinerun_info.name record["task_id"] = pipelinerun_name record["task_url"] = self._konflux_client.resource_url(pipelinerun_info.to_dict()) -await self.update_konflux_db( - metadata, - build_repo, - pipelinerun_info, - KonfluxBuildOutcome.PENDING, - building_arches, - build_priority, -) +if not self._config.dry_run: + await self.update_konflux_db( + metadata, + build_repo, + pipelinerun_info, + KonfluxBuildOutcome.PENDING, + building_arches, + build_priority, + ) +else: + logger.info("Dry run: Skipping initial PENDING DB write")Alternatively, move the dry-run check inside
update_konflux_dbitself to centralize the logic and ensure all DB writes respect the flag.Based on learnings
🧹 Nitpick comments (1)
doozer/doozerlib/backend/konflux_image_builder.py (1)
295-296: Retry decorator should use exponential backoff with jitter per best practices.The current implementation uses
wait_fixed(5)for all retries. Per Tenacity best practices and AWS guidance, fixed rapid retries can trigger retry storms and thundering-herd problems. Industry standards recommend exponential backoff with jitter (e.g.,wait_random_exponential(multiplier=1, max=30)) to avoid server overload, especially for external registry operations.Consider refactoring to:
@retry(reraise=True, wait=wait_random_exponential(multiplier=1, max=30), stop=stop_after_attempt(3) | stop_after_delay(60)) async def _validate_build_attestation_and_signature(self, image_pullspec: str, distgit_key: str):This pattern bounds both attempts and total latency while respecting exponential backoff conventions for registry communication.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
artcommon/artcommonlib/konflux/konflux_db.py(2 hunks)doozer/doozerlib/backend/konflux_image_builder.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- artcommon/artcommonlib/konflux/konflux_db.py
🧰 Additional context used
🪛 Ruff (0.14.4)
doozer/doozerlib/backend/konflux_image_builder.py
227-227: Do not catch blind exception: Exception
(BLE001)
228-230: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: tests
🔇 Additional comments (1)
doozer/doozerlib/backend/konflux_image_builder.py (1)
233-249: LGTM: Correct dry-run handling and outcome processing.The final DB write correctly guards against dry-run mode, and the outcome handling properly sets build status and raises errors when needed. This pattern should be consistently applied to the initial PENDING write as well (see comment on lines 181-201).
| # Even if the build succeeded, if the SLSA attestation cannot be retrieved, it is unreleasable. | ||
| if outcome is KonfluxBuildOutcome.SUCCESS: | ||
| results = pipelinerun_info.to_dict().get('status', {}).get('results', []) | ||
| image_pullspec = next((r['value'] for r in results if r['name'] == 'IMAGE_URL'), None) | ||
| image_digest = next((r['value'] for r in results if r['name'] == 'IMAGE_DIGEST'), None) | ||
|
|
||
| image_tag = image_pullspec.split(':')[-1] | ||
| record["image_tag"] = image_tag | ||
| record["image_pullspec"] = f"{image_pullspec.split(':')[0]}@{image_digest}" | ||
|
|
||
| # Validate SLSA attestation and source image signature | ||
| try: | ||
| await self._validate_build_attestation_and_signature(image_pullspec, metadata.distgit_key) | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Failed to get SLA attestation / source signature from konflux for image {image_pullspec}, marking build as {KonfluxBuildOutcome.FAILURE}. Error: {e}" | ||
| ) | ||
| outcome = KonfluxBuildOutcome.FAILURE | ||
| image_tag = image_pullspec.split(':')[-1] | ||
| record["image_tag"] = image_tag | ||
|
|
||
| if self._config.dry_run: | ||
| logger.info("Dry run: Would have inserted build record in Konflux DB") | ||
| else: | ||
| await self.update_konflux_db( | ||
| metadata, build_repo, pipelinerun_info, outcome, building_arches, build_priority | ||
| # Validate SLSA attestation and source image signature | ||
| try: | ||
| await self._validate_build_attestation_and_signature(image_pullspec, metadata.distgit_key) | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Failed to get SLA attestation / source signature from konflux for image {image_pullspec}, marking build as {KonfluxBuildOutcome.FAILURE}. Error: {e}" | ||
| ) | ||
| outcome = KonfluxBuildOutcome.FAILURE | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add defensive null checks and improve exception logging.
Two issues in the success path:
-
Missing null checks before
.split()calls (lines 216-221)
Bothimage_pullspecandimage_digestare extracted usingnext()with defaultNone, but lines 219 and 221 immediately call.split()without validation. If either value isNone, this will raiseAttributeErrorwith an unclear error message. Theupdate_konflux_dbmethod (lines 732-736) already implements the correct defensive check:if not (image_pullspec and image_digest): raise ValueError(...)
Mirror this validation here before any
.split()operations. -
Exception handling loses stack traces (lines 225-230)
The broadExceptioncatch (flagged by Ruff BLE001) usinglogger.error(...)(flagged by Ruff TRY400) discards the stack trace, making attestation and signature validation failures difficult to debug. Switch tologger.exception(...)to preserve the full traceback.
Apply these fixes:
if outcome is KonfluxBuildOutcome.SUCCESS:
results = pipelinerun_info.to_dict().get('status', {}).get('results', [])
image_pullspec = next((r['value'] for r in results if r['name'] == 'IMAGE_URL'), None)
image_digest = next((r['value'] for r in results if r['name'] == 'IMAGE_DIGEST'), None)
+ if not (image_pullspec and image_digest):
+ raise ValueError(
+ f"[{metadata.distgit_key}] Could not find expected results in konflux "
+ f"pipelinerun {pipelinerun_name}"
+ )
+
record["image_pullspec"] = f"{image_pullspec.split(':')[0]}@{image_digest}"
image_tag = image_pullspec.split(':')[-1]
record["image_tag"] = image_tag
# Validate SLSA attestation and source image signature
try:
await self._validate_build_attestation_and_signature(image_pullspec, metadata.distgit_key)
- except Exception as e:
- logger.error(
+ except Exception:
+ logger.exception(
f"Failed to get SLA attestation / source signature from konflux for image {image_pullspec}, marking build as {KonfluxBuildOutcome.FAILURE}. Error: {e}"
)
outcome = KonfluxBuildOutcome.FAILUREBased on learnings
🧰 Tools
🪛 Ruff (0.14.4)
227-227: Do not catch blind exception: Exception
(BLE001)
228-230: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In doozer/doozerlib/backend/konflux_image_builder.py around lines 213 to 232,
add defensive null checks for image_pullspec and image_digest before calling
.split(): if either is falsy, raise a ValueError with a clear message (mirror
the validation in update_konflux_db) so you don't call .split() on None; after
successful validation, compute image_pullspec, image_digest, image_tag and set
record fields as before. Replace the logger.error(...) in the except block with
logger.exception(...) to preserve the full traceback, then set outcome =
KonfluxBuildOutcome.FAILURE as currently done.
| @retry(reraise=True, wait=wait_fixed(5), stop=stop_after_attempt(3)) | ||
| async def update_konflux_db( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
LGTM: Retry decorator handles transient DB connection failures.
The retry decorator appropriately targets database write operations that may fail due to transient network or connection issues. The method writes to both Konflux DB (line 760) and BigQuery (line 774), and retrying on failure prevents build record loss.
Verify that the underlying DB operations (add_build and insert_rows_json) are idempotent or have proper duplicate detection to prevent multiple records on retry:
🏁 Script executed:
#!/bin/bash
# Check if konflux_db.add_build has duplicate handling
rg -n -B5 -A15 'def add_build' --type pyLength of output: 5627
---
🏁 Script executed:
```shell
#!/bin/bash
# Find bq_client initialization and how insert is called
rg -n -B10 -A5 'self.bq_client\s*=' --type py
Length of output: 1511
🏁 Script executed:
#!/bin/bash
# Check for table schema definition and unique constraints
rg -n -B5 -A10 'SchemaField\|PRIMARY\|UNIQUE\|constraint' --type py | head -100Length of output: 0
🏁 Script executed:
#!/bin/bash
# Find insert_rows_json usage in update_konflux_db
rg -n -B5 -A10 'insert_rows_json' --type pyLength of output: 3234
---
🏁 Script executed:
```shell
#!/bin/bash
# Check table schema creation - look for unique or primary key constraints
rg -n 'CREATE TABLE\|build_id\|record_id' artcommon/artcommonlib/konflux/konflux_db.py -A20
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check the KonfluxRecord/KonfluxBuildRecord structure for identifiers
rg -n -A30 'class KonfluxBuildRecord\|class KonfluxRecord' artcommon/artcommonlib/konflux/Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check line 760 context - what exactly calls add_build and whether it checks for existing records
sed -n '755,765p' doozer/doozerlib/backend/konflux_image_builder.pyLength of output: 615
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for KonfluxBuildRecord definition more broadly
find . -name '*.py' -type f | xargs grep -l 'class KonfluxBuildRecord' 2>/dev/null
Length of output: 89
🏁 Script executed:
#!/bin/bash
# Search for KonfluxRecord definition
find . -name '*.py' -type f | xargs grep -l 'class KonfluxRecord' 2>/dev/nullLength of output: 56
🏁 Script executed:
#!/bin/bash
# Check artcommon/artcommonlib/konflux directory structure
fd . artcommon/artcommonlib/konflux --type f -name '*.py'Length of output: 230
🏁 Script executed:
#!/bin/bash
# Check if there's a migration file or schema file that defines the table
fd 'konflux.*\.py\|schema' . --type f | grep -v __pycache__Length of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Examine KonfluxBuildRecord class definition
cat artcommon/artcommonlib/konflux/konflux_build_record.py
Length of output: 15785
🏁 Script executed:
#!/bin/bash
# Check BigQueryClient implementation for insert method
cat artcommon/artcommonlib/bigquery.py | head -150Length of output: 4840
---
🏁 Script executed:
```shell
#!/bin/bash
# Check for any PRIMARY KEY or UNIQUE constraints on record_id in table creation
rg -n -B10 -A20 'record_id\|PRIMARY\|UNIQUE' artcommon/artcommonlib/konflux/konflux_db.py | grep -A20 "def get_schema\|SchemaField"
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Verify whether record_id field is marked as unique in schema generation
rg -n -C5 'generate_build_schema\|get_schema' artcommon/artcommonlib/konflux/Length of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Look at the full add_build implementation
sed -n '481,509p' artcommon/artcommonlib/konflux/konflux_db.py
Length of output: 913
🏁 Script executed:
#!/bin/bash
# Search for any duplicate checking or idempotency logic in update_konflux_db
sed -n '653,800p' doozer/doozerlib/backend/konflux_image_builder.pyLength of output: 6672
🏁 Script executed:
#!/bin/bash
# Check if there's any query to check for existing records before insert
rg -n 'SELECT.*record_id\|WHERE.*record_id\|EXISTS' doozer/doozerlib/backend/konflux_image_builder.pyLength of output: 0
🏁 Script executed:
#!/bin/bash
# Check all schema field definitions in konflux_db.py to see if record_id has UNIQUE mode
rg -n 'SchemaField\|mode=' artcommon/artcommonlib/konflux/konflux_db.pyLength of output: 0
Critical issue: Retry decorator will create duplicate database records.
The @retry decorator on update_konflux_db (line 653) is unsafe because:
add_build()uses standard SQLINSERT INTOwhich does not deduplicate on retry (bigquery.py line 75 acknowledges this limitation)- Each retry creates a new
record_id(randomly generated UUID) even for the same build - No duplicate detection or idempotency checks exist before inserting
- Each retry attempt will insert a new row with identical
build_idbut differentrecord_id
Fixes required:
- Either use BigQuery streaming API with
insertIdfor idempotency (as suggested in bigquery.py line 75), OR - Add pre-insert duplicate detection (query for existing build_id/record_id before insert), OR
- Remove the retry decorator and handle failures differently, OR
- Make
record_iddeterministic instead of random so retries are naturally idempotent
The same issue applies to the insert_rows_json() call at line 774 for the taskrun table.
🧰 Tools
🪛 Pylint (4.0.2)
[refactor] 654-654: Too many arguments (7/5)
(R0913)
[refactor] 654-654: Too many positional arguments (7/5)
(R0917)
[refactor] 654-654: Too many local variables (35/15)
(R0914)
[refactor] 654-654: Too many statements (56/50)
(R0915)
🤖 Prompt for AI Agents
In doozer/doozerlib/backend/konflux_image_builder.py around lines 653-654 (and
the insert_rows_json call at ~774), the current @retry on update_konflux_db will
cause duplicate DB rows because add_build() uses non-idempotent INSERTs and
record_id is randomly generated; fix by removing the retry decorator and making
the insert idempotent: generate a deterministic record_id (e.g., UUIDv5 or hash
derived from build_id and table name) so retries insert the same primary key,
and for the taskrun insert_rows_json use BigQuery streaming insertId or perform
a pre-insert existence check (SELECT by build_id) to avoid duplicates; ensure
error handling surfaces failures without automatic blind retries.
9a6d295 to
9b2165d
Compare
|
@locriandev: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Currently, we try an image builds to a maximum of 3 times in a row if it fails. This has proven to be ineffective, and we mostly get 3 build failures in a row that only pollute our images health checks. Removing the iteration, while adding retries to functions that could fail to networking/communication issues should fix this.
Test build: https://art-jenkins.apps.prod-stable-spoke1-dc-iad2.itup.redhat.com/job/aos-cd-builds/job/build%252Focp4-konflux/17335/console