Limit Konflux build attempts to 1 #2032

locriandev · 2025-09-23T14:42:54Z

Currently, we try an image builds to a maximum of 3 times in a row if it fails. This has proven to be ineffective, and we mostly get 3 build failures in a row that only pollute our images health checks. Removing the iteration, while adding retries to functions that could fail to networking/communication issues should fix this.

Test build: https://art-jenkins.apps.prod-stable-spoke1-dc-iad2.itup.redhat.com/job/aos-cd-builds/job/build%252Focp4-konflux/17335/console

vfreex · 2025-09-24T08:00:46Z

/lgtm

locriandev · 2025-09-24T09:45:21Z

/hold

openshift-ci · 2025-09-24T09:50:45Z

New changes are detected. LGTM label has been removed.

openshift-ci · 2025-09-24T09:50:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from vfreex. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2025-11-14T10:13:20Z

Walkthrough

Added tenacity retry decorators to Konflux DB access and to two async methods in the Konflux image builder; removed the builder's internal retry loop, added a retry-backed attestation/signature validation step, and made DB writes conditional on dry-run, with outcome-driven error handling.

Changes

Cohort / File(s)	Summary
Database retry configuration `artcommon/artcommonlib/konflux/konflux_db.py`	Imported tenacity utilities and decorated `get_build_records_by_nvrs` with `@retry(reraise=True, wait=wait_fixed(5), stop=stop_after_attempt(3))`.
Image builder refactor & validation `doozer/doozerlib/backend/konflux_image_builder.py`	Removed the explicit retry loop around build attempts and now performs a single `start_build` call. Added async `_validate_build_attestation_and_signature` and `update_konflux_db`, both decorated with the same tenacity retry policy. Moved image_pullspec/digest/tag population to only on SUCCESS, added attestation/signature validation that can set outcome to FAILURE, consolidated DB update path honoring `dry_run`, and raise `KonfluxImageBuildError` for non-SUCCESS outcomes. Adjusted PipelineRun logging/timing.

Sequence Diagram

sequenceDiagram
    actor Builder
    participant ImageBuilder as KonfluxImageBuilder
    participant PipelineRun
    participant Attestation as Validation
    participant KonfluxDB

    Builder->>ImageBuilder: start_build()

    rect rgb(200,220,255)
    Note over ImageBuilder,PipelineRun: Single start_build invocation (no retry loop)
    ImageBuilder->>PipelineRun: create/start PipelineRun
    PipelineRun-->>ImageBuilder: PipelineRun info
    end

    rect rgb(220,240,220)
    Note over ImageBuilder,Attestation: Validation step (retry-decorated)
    ImageBuilder->>Attestation: _validate_build_attestation_and_signature(image_pullspec, distgit_key)
    Attestation->>Attestation: verify SLSA attestation & source signature (retries)
    alt validation succeeds
        Attestation-->>ImageBuilder: validation OK
    else validation fails
        Attestation-->>ImageBuilder: validation FAILED -> outcome = FAILURE
    end
    end

    rect rgb(240,240,200)
    Note over ImageBuilder,KonfluxDB: Conditional DB update (retry-decorated)
    alt dry_run == false
        ImageBuilder->>KonfluxDB: update_konflux_db(metadata, ..., outcome) (retries)
        KonfluxDB-->>ImageBuilder: updated
    else dry_run == true
        ImageBuilder->>ImageBuilder: skip DB write
    end
    end

    alt outcome == SUCCESS
        ImageBuilder-->>Builder: return success + metadata
    else outcome != SUCCESS
        ImageBuilder-->>Builder: raise KonfluxImageBuildError
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Review focus:
- doozer/doozerlib/backend/konflux_image_builder.py: control-flow changes, new async validation method, retry decorators, dry-run DB path, and metadata population timing.
- artcommon/artcommonlib/konflux/konflux_db.py: correct tenacity usage on get_build_records_by_nvrs.
- Verify async retry semantics, exception handling, and logging positions relative to state updates.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9a6d295 and 9b2165d.

📒 Files selected for processing (2)

artcommon/artcommonlib/konflux/konflux_db.py (2 hunks)
doozer/doozerlib/backend/konflux_image_builder.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

artcommon/artcommonlib/konflux/konflux_db.py

🧰 Additional context used

🪛 Ruff (0.14.4)

doozer/doozerlib/backend/konflux_image_builder.py

227-227: Do not catch blind exception: Exception

(BLE001)

228-230: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests

🔇 Additional comments (3)

doozer/doozerlib/backend/konflux_image_builder.py (3)

203-208: LGTM: Clean wait logic with appropriate logging.

The waiting sequence is clear and well-logged.

233-249: LGTM: Appropriate error handling and dry-run check.

The conditional DB update based on dry-run mode and the outcome-driven error handling are correctly implemented in this segment.

295-295: LGTM: Retry decorator appropriate for transient network failures.

The retry decorator on _validate_build_attestation_and_signature is well-suited for handling transient failures when fetching SLSA attestations and signatures from the registry.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

doozer/doozerlib/backend/konflux_image_builder.py (1)

653-719: update_konflux_db retry scope and build_priority casting require fixes

Two behavioral issues confirmed in the implementation:

build_priority may be None and cause int(build_priority) to fail
When self._config.build_priority defaults to None, the logic at lines 173–179 assigns None directly to build_priority (the else branch executes when not "auto"). This None value flows through to update_konflux_db and reaches line 718 where int(build_priority) raises a TypeError. The retry decorator then wastes 15 seconds retrying this deterministic crash.

Fix: Add validation in build() to treat None as "auto", or raise a clear configuration error early if build_priority is not set to an acceptable value.

@retry retries all exceptions, including persistent deterministic errors
The decorator has no retry_on parameter, so it retries any exception—including KeyError from missing Dockerfile labels (lines 674–679) and ValueError from schema validation in add_build(). These failures are deterministic and won't be fixed by retries; retrying only adds delay and can cause multiple inserts if failure occurs after a partial side-effect.

Fix: Restrict retries to transient failures only (e.g., BigQuery transport errors), not configuration/validation errors. Check add_build() error types and configure @retry to only handle known transient exceptions.

🧹 Nitpick comments (2)

artcommon/artcommonlib/konflux/konflux_db.py (1)

967-1024: Narrow retry conditions for get_build_records_by_nvrs to avoid retrying non-transient errors

Decorating get_build_records_by_nvrs with tenacity is a good fit for transient BigQuery / networking issues, and the aggregation of per-NVR errors into a single IOError is clear.

Right now, the default @retry behavior will treat all exceptions as retriable, including:

The local ValueError raised when where contains nvr or outcome.

Any IOError coming from strict “record not found” conditions in get_latest_build (if this method is used with strict=True).

That means callers can see unnecessary 10s+ delays before surfacing deterministic errors.

Consider tightening the retry condition, for example by:

Using retry=retry_if_exception_type(<transient-exception-types>) (e.g. IO/transport/BigQuery client errors), and/or

Separating “not found” from genuine connectivity failures (e.g. with a custom exception type) so only the latter are retried.

This keeps the new resilience benefits while avoiding retries for parameter validation and expected “not found” cases.

doozer/doozerlib/backend/konflux_image_builder.py (1)

295-337: Scope the retry behavior in _validate_build_attestation_and_signature to truly transient failures

Adding _validate_build_attestation_and_signature and decorating it with tenacity is a good way to encapsulate SLSA + signature validation and handle transient issues from fetch_slsa_attestation / get_konflux_data.

A couple of refinements:

Retrying deterministic validation errors

The function raises ValueError for an empty attestation and for a missing IMAGE_REF in the tasks.

With the default @retry(reraise=True, ...), these ValueErrors are also treated as retriable and will be re-run 2 extra times with 5-second waits, even though they are deterministic once the attestation payload is fetched.

To keep retries focused on genuinely transient problems (CLI/network/registry hiccups), consider restricting the retry condition to the expected transient exception types (e.g., ChildProcessError from the underlying commands, or specific I/O/transport errors).

Exception logging at callsite

In build(), you currently do except Exception as e: and log via logger.error(...). This swallows the traceback and triggers the Ruff hint.

Switching to logger.exception(...) there would both address the lint warning and make it easier to debug unexpected exceptions from this helper.

These changes keep the resilience benefits but avoid unnecessary delays when validation fails due to data/contract issues rather than transport flakiness.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between cd5d4cc and b0d239e.

📒 Files selected for processing (2)

artcommon/artcommonlib/konflux/konflux_db.py (2 hunks)
doozer/doozerlib/backend/konflux_image_builder.py (3 hunks)

🧰 Additional context used

🪛 Ruff (0.14.4)

doozer/doozerlib/backend/konflux_image_builder.py

227-227: Do not catch blind exception: Exception

(BLE001)

228-230: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests

🔇 Additional comments (3)

artcommon/artcommonlib/konflux/konflux_db.py (1)

20-20: Tenacity import wiring looks fine

Importing retry, stop_after_attempt, and wait_fixed here lines up with the new retry decorator on get_build_records_by_nvrs and keeps retry semantics co-located with the DB client.

doozer/doozerlib/backend/konflux_image_builder.py (2)

35-35: Tenacity import aligns with new retryable operations

The tenacity import here is consistent with the new retry decorators on _validate_build_attestation_and_signature, update_konflux_db, and the internal SBOM helper, which matches the PR goal of moving retries closer to the potentially flaky I/O.

570-605: Tenacity on _get_sbom_with_retry is a good fit for flaky cosign calls

The inner _get_sbom_with_retry helper wrapped with @retry(stop=stop_after_attempt(3), wait=wait_fixed(5)) is an appropriate place to add retries: it’s tightly scoped around the cosign CLI call and SBOM validation and has no additional side effects.

No changes needed here; this is a clean use of tenacity for transient failures.

coderabbitai · 2025-11-14T10:23:04Z

doozer/doozerlib/backend/konflux_image_builder.py

+            pipelinerun_info = await self._start_build(
+                metadata=metadata,
+                build_repo=build_repo,
+                building_arches=building_arches,
+                output_image=output_image,
+                additional_tags=additional_tags,
+                nvr=nvr,
+                build_priority=build_priority,
+                dest_dir=dest_dir,
+            )
+            pipelinerun_name = pipelinerun_info.name
+            record["task_id"] = pipelinerun_name
+            record["task_url"] = self._konflux_client.resource_url(pipelinerun_info.to_dict())
+            await self.update_konflux_db(
+                metadata,
+                build_repo,
+                pipelinerun_info,
+                KonfluxBuildOutcome.PENDING,
+                building_arches,
+                build_priority,
+            )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Dry-run still writes a PENDING record and error handling around outcome could be tightened

The refactored single-attempt build sequence (start → wait → compute outcome → validate attestation/signature → write final status) is much clearer and matches the intent to avoid multiple PipelineRun creations.

A couple of behavioral issues to consider:

Dry-run mode still writes to the Konflux DB (PENDING record)

You always call await self.update_konflux_db(..., outcome=KonfluxBuildOutcome.PENDING, ...) right after starting the build (Line 194), regardless of self._config.dry_run.

Later, in Lines 233–238, you skip the final DB write when dry_run is true, but the initial PENDING write still happens.

The log message “Dry run: Would have inserted build record in Konflux DB” suggests the intent is to avoid DB writes entirely in dry-run mode.

If dry-run is meant to be side-effect free w.r.t BigQuery, you likely want either:

To guard the initial PENDING write with the same if self._config.dry_run condition, or

To make update_konflux_db itself a no-op (early-return) when self._config.dry_run is true.

Otherwise, a dry-run will leave a PENDING record in the DB, which may pollute metrics and tooling that depend on build status.

Outcome / error handling and logging on validation failure

When _validate_build_attestation_and_signature fails, you set outcome = KonfluxBuildOutcome.FAILURE and proceed to the final DB update, which is the right behavior from a releaseability standpoint.

However, for failures here you only log logger.error(...) without a traceback. Given these errors are often subtle (attestation shape, JSON content, signatures), a full stack trace will make debugging much easier.

Consider switching that logger call to logger.exception(...) so the stack trace is captured, or at least using LOGGER.exception with the same message.

Overall flow looks good; fixing the dry-run DB write and improving logging here will better align behavior with expectations and make failures easier to diagnose.

Also applies to: 233-249

I'll investigate the code to verify the concerns raised in the review comment.

--- 🏁 Script executed: ```shell #!/bin/bash # Extract the function containing lines 181-201 to see full context cat doozer/doozerlib/backend/konflux_image_builder.py | head -300 | tail -130

Length of output: 6143

🏁 Script executed:

#!/bin/bash # Show lines 233-249 explicitly sed -n '233,249p' doozer/doozerlib/backend/konflux_image_builder.py

Length of output: 764

🏁 Script executed:

#!/bin/bash # Find update_konflux_db method implementation ast-grep --pattern $'function update_konflux_db($_, $_) { $$$ }'

Length of output: 0

🏁 Script executed:

#!/bin/bash # Search for _validate_build_attestation_and_signature calls and error handling rg -n '_validate_build_attestation_and_signature' -A 10 -B 2

Length of output: 2771

🏁 Script executed:

#!/bin/bash # Search for dry_run checks in this file rg -n 'dry_run' doozer/doozerlib/backend/konflux_image_builder.py -B 2 -A 2

Length of output: 834

--- 🏁 Script executed: ```shell #!/bin/bash # Search for update_konflux_db method definition rg -n 'def update_konflux_db' -A 15

Length of output: 5252

🏁 Script executed:

#!/bin/bash # Also check if it's an async method rg -n 'async def update_konflux_db' -A 15

Length of output: 2559

🏁 Script executed:

#!/bin/bash # Check the full flow from line 194 to 237 - extract exact line numbers sed -n '190,240p' doozer/doozerlib/backend/konflux_image_builder.py | cat -n

Length of output: 2891

Guard initial PENDING record write with dry-run check, and improve error logging for attestation validation failures

Two confirmed issues:

Dry-run writes inconsistent PENDING records to Konflux DB (line 194)
The unguarded await self.update_konflux_db(..., KonfluxBuildOutcome.PENDING, ...) at line 194 always executes, even when self._config.dry_run is true. The final DB write (lines 233–238) correctly skips on dry-run, but the initial PENDING write doesn't, leaving stale records. Wrap line 194 with if not self._config.dry_run: or move the check inside update_konflux_db.

Missing stack trace for attestation/signature validation errors (lines 226–230)
When _validate_build_attestation_and_signature fails, the exception is caught and logged using logger.error(...) with only the stringified error message. Switch to logger.exception(...) to capture the full stack trace, which is essential for debugging signature and attestation shape issues.

🤖 Prompt for AI Agents

In doozer/doozerlib/backend/konflux_image_builder.py around lines 181 to 201, the initial PENDING DB write at line ~194 is executed even during dry-run and the attestation/signature validation error logging (around lines 226–230) loses stack traces; guard the initial update_konflux_db call with a dry-run check (e.g., if not self._config.dry_run: await self.update_konflux_db(..., KonfluxBuildOutcome.PENDING, ...)) or move the dry-run check inside update_konflux_db so no PENDING record is written during dry runs, and replace logger.error(str(err)) in the exception handler for _validate_build_attestation_and_signature with logger.exception(...) (or logger.error(..., exc_info=True)) so the full stack trace is recorded.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

doozer/doozerlib/backend/konflux_image_builder.py (1)
181-201: Guard the initial PENDING DB write with dry-run check to prevent stale records.

The initial call to update_konflux_db with KonfluxBuildOutcome.PENDING (line 194) executes unconditionally, even when self._config.dry_run is true. However, the final DB write at lines 233-238 correctly skips updates during dry-run. This inconsistency leaves stale PENDING records in the Konflux DB during dry-run mode, polluting metrics and health checks.

Wrap the initial DB write with a dry-run check:
 pipelinerun_name = pipelinerun_info.name
 record["task_id"] = pipelinerun_name
 record["task_url"] = self._konflux_client.resource_url(pipelinerun_info.to_dict())
-await self.update_konflux_db(
-    metadata,
-    build_repo,
-    pipelinerun_info,
-    KonfluxBuildOutcome.PENDING,
-    building_arches,
-    build_priority,
-)
+if not self._config.dry_run:
+    await self.update_konflux_db(
+        metadata,
+        build_repo,
+        pipelinerun_info,
+        KonfluxBuildOutcome.PENDING,
+        building_arches,
+        build_priority,
+    )
+else:
+    logger.info("Dry run: Skipping initial PENDING DB write")
Alternatively, move the dry-run check inside update_konflux_db itself to centralize the logic and ensure all DB writes respect the flag.

Based on learnings

🧹 Nitpick comments (1)

doozer/doozerlib/backend/konflux_image_builder.py (1)
295-296: Retry decorator should use exponential backoff with jitter per best practices.

The current implementation uses wait_fixed(5) for all retries. Per Tenacity best practices and AWS guidance, fixed rapid retries can trigger retry storms and thundering-herd problems. Industry standards recommend exponential backoff with jitter (e.g., wait_random_exponential(multiplier=1, max=30)) to avoid server overload, especially for external registry operations.

Consider refactoring to:
@retry(reraise=True, wait=wait_random_exponential(multiplier=1, max=30), stop=stop_after_attempt(3) | stop_after_delay(60))
async def _validate_build_attestation_and_signature(self, image_pullspec: str, distgit_key: str):
This pattern bounds both attempts and total latency while respecting exponential backoff conventions for registry communication.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between b0d239e and 9a6d295.

📒 Files selected for processing (2)

artcommon/artcommonlib/konflux/konflux_db.py (2 hunks)
doozer/doozerlib/backend/konflux_image_builder.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

artcommon/artcommonlib/konflux/konflux_db.py

🧰 Additional context used

🪛 Ruff (0.14.4)

doozer/doozerlib/backend/konflux_image_builder.py

227-227: Do not catch blind exception: Exception

(BLE001)

228-230: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests

🔇 Additional comments (1)

doozer/doozerlib/backend/konflux_image_builder.py (1)

233-249: LGTM: Correct dry-run handling and outcome processing.

The final DB write correctly guards against dry-run mode, and the outcome handling properly sets build status and raises errors when needed. This pattern should be consistently applied to the initial PENDING write as well (see comment on lines 181-201).

coderabbitai · 2025-11-15T10:29:50Z

doozer/doozerlib/backend/konflux_image_builder.py

+            # Even if the build succeeded, if the SLSA attestation cannot be retrieved, it is unreleasable.
+            if outcome is KonfluxBuildOutcome.SUCCESS:
+                results = pipelinerun_info.to_dict().get('status', {}).get('results', [])
+                image_pullspec = next((r['value'] for r in results if r['name'] == 'IMAGE_URL'), None)
+                image_digest = next((r['value'] for r in results if r['name'] == 'IMAGE_DIGEST'), None)

-                    image_tag = image_pullspec.split(':')[-1]
-                    record["image_tag"] = image_tag
+                record["image_pullspec"] = f"{image_pullspec.split(':')[0]}@{image_digest}"

-                    # Validate SLSA attestation and source image signature
-                    try:
-                        await self._validate_build_attestation_and_signature(image_pullspec, metadata.distgit_key)
-                    except Exception as e:
-                        logger.error(
-                            f"Failed to get SLA attestation / source signature from konflux for image {image_pullspec}, marking build as {KonfluxBuildOutcome.FAILURE}. Error: {e}"
-                        )
-                        outcome = KonfluxBuildOutcome.FAILURE
+                image_tag = image_pullspec.split(':')[-1]
+                record["image_tag"] = image_tag

-                if self._config.dry_run:
-                    logger.info("Dry run: Would have inserted build record in Konflux DB")
-                else:
-                    await self.update_konflux_db(
-                        metadata, build_repo, pipelinerun_info, outcome, building_arches, build_priority
+                # Validate SLSA attestation and source image signature
+                try:
+                    await self._validate_build_attestation_and_signature(image_pullspec, metadata.distgit_key)
+                except Exception as e:
+                    logger.error(
+                        f"Failed to get SLA attestation / source signature from konflux for image {image_pullspec}, marking build as {KonfluxBuildOutcome.FAILURE}. Error: {e}"
                    )
+                    outcome = KonfluxBuildOutcome.FAILURE
+


⚠️ Potential issue | 🟠 Major

Add defensive null checks and improve exception logging.

Two issues in the success path:

Missing null checks before .split() calls (lines 216-221)
Both image_pullspec and image_digest are extracted using next() with default None, but lines 219 and 221 immediately call .split() without validation. If either value is None, this will raise AttributeError with an unclear error message. The update_konflux_db method (lines 732-736) already implements the correct defensive check:

if not (image_pullspec and image_digest): raise ValueError(...)

Mirror this validation here before any .split() operations.

Exception handling loses stack traces (lines 225-230)
The broad Exception catch (flagged by Ruff BLE001) using logger.error(...) (flagged by Ruff TRY400) discards the stack trace, making attestation and signature validation failures difficult to debug. Switch to logger.exception(...) to preserve the full traceback.

Apply these fixes:

if outcome is KonfluxBuildOutcome.SUCCESS: results = pipelinerun_info.to_dict().get('status', {}).get('results', []) image_pullspec = next((r['value'] for r in results if r['name'] == 'IMAGE_URL'), None) image_digest = next((r['value'] for r in results if r['name'] == 'IMAGE_DIGEST'), None) + if not (image_pullspec and image_digest): + raise ValueError( + f"[{metadata.distgit_key}] Could not find expected results in konflux " + f"pipelinerun {pipelinerun_name}" + ) + record["image_pullspec"] = f"{image_pullspec.split(':')[0]}@{image_digest}" image_tag = image_pullspec.split(':')[-1] record["image_tag"] = image_tag # Validate SLSA attestation and source image signature try: await self._validate_build_attestation_and_signature(image_pullspec, metadata.distgit_key) - except Exception as e: - logger.error( + except Exception: + logger.exception( f"Failed to get SLA attestation / source signature from konflux for image {image_pullspec}, marking build as {KonfluxBuildOutcome.FAILURE}. Error: {e}" ) outcome = KonfluxBuildOutcome.FAILURE

Based on learnings

🧰 Tools

🪛 Ruff (0.14.4)

227-227: Do not catch blind exception: Exception

(BLE001)

228-230: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents

In doozer/doozerlib/backend/konflux_image_builder.py around lines 213 to 232, add defensive null checks for image_pullspec and image_digest before calling .split(): if either is falsy, raise a ValueError with a clear message (mirror the validation in update_konflux_db) so you don't call .split() on None; after successful validation, compute image_pullspec, image_digest, image_tag and set record fields as before. Replace the logger.error(...) in the except block with logger.exception(...) to preserve the full traceback, then set outcome = KonfluxBuildOutcome.FAILURE as currently done.

coderabbitai · 2025-11-15T10:29:50Z

doozer/doozerlib/backend/konflux_image_builder.py

+    @retry(reraise=True, wait=wait_fixed(5), stop=stop_after_attempt(3))
    async def update_konflux_db(


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

LGTM: Retry decorator handles transient DB connection failures.

The retry decorator appropriately targets database write operations that may fail due to transient network or connection issues. The method writes to both Konflux DB (line 760) and BigQuery (line 774), and retrying on failure prevents build record loss.

Verify that the underlying DB operations (add_build and insert_rows_json) are idempotent or have proper duplicate detection to prevent multiple records on retry:

🏁 Script executed:

#!/bin/bash # Check if konflux_db.add_build has duplicate handling rg -n -B5 -A15 'def add_build' --type py

Length of output: 5627

--- 🏁 Script executed: ```shell #!/bin/bash # Find bq_client initialization and how insert is called rg -n -B10 -A5 'self.bq_client\s*=' --type py

Length of output: 1511

🏁 Script executed:

#!/bin/bash # Check for table schema definition and unique constraints rg -n -B5 -A10 'SchemaField\|PRIMARY\|UNIQUE\|constraint' --type py | head -100

Length of output: 0

🏁 Script executed:

#!/bin/bash # Find insert_rows_json usage in update_konflux_db rg -n -B5 -A10 'insert_rows_json' --type py

Length of output: 3234

--- 🏁 Script executed: ```shell #!/bin/bash # Check table schema creation - look for unique or primary key constraints rg -n 'CREATE TABLE\|build_id\|record_id' artcommon/artcommonlib/konflux/konflux_db.py -A20

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check the KonfluxRecord/KonfluxBuildRecord structure for identifiers rg -n -A30 'class KonfluxBuildRecord\|class KonfluxRecord' artcommon/artcommonlib/konflux/

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check line 760 context - what exactly calls add_build and whether it checks for existing records sed -n '755,765p' doozer/doozerlib/backend/konflux_image_builder.py

Length of output: 615

--- 🏁 Script executed: ```shell #!/bin/bash # Search for KonfluxBuildRecord definition more broadly find . -name '*.py' -type f | xargs grep -l 'class KonfluxBuildRecord' 2>/dev/null

Length of output: 89

🏁 Script executed:

#!/bin/bash # Search for KonfluxRecord definition find . -name '*.py' -type f | xargs grep -l 'class KonfluxRecord' 2>/dev/null

Length of output: 56

🏁 Script executed:

#!/bin/bash # Check artcommon/artcommonlib/konflux directory structure fd . artcommon/artcommonlib/konflux --type f -name '*.py'

Length of output: 230

🏁 Script executed:

#!/bin/bash # Check if there's a migration file or schema file that defines the table fd 'konflux.*\.py\|schema' . --type f | grep -v __pycache__

Length of output: 0

--- 🏁 Script executed: ```shell #!/bin/bash # Examine KonfluxBuildRecord class definition cat artcommon/artcommonlib/konflux/konflux_build_record.py

Length of output: 15785

🏁 Script executed:

#!/bin/bash # Check BigQueryClient implementation for insert method cat artcommon/artcommonlib/bigquery.py | head -150

Length of output: 4840

--- 🏁 Script executed: ```shell #!/bin/bash # Check for any PRIMARY KEY or UNIQUE constraints on record_id in table creation rg -n -B10 -A20 'record_id\|PRIMARY\|UNIQUE' artcommon/artcommonlib/konflux/konflux_db.py | grep -A20 "def get_schema\|SchemaField"

Length of output: 0

🏁 Script executed:

#!/bin/bash # Verify whether record_id field is marked as unique in schema generation rg -n -C5 'generate_build_schema\|get_schema' artcommon/artcommonlib/konflux/

Length of output: 0

--- 🏁 Script executed: ```shell #!/bin/bash # Look at the full add_build implementation sed -n '481,509p' artcommon/artcommonlib/konflux/konflux_db.py

Length of output: 913

🏁 Script executed:

#!/bin/bash # Search for any duplicate checking or idempotency logic in update_konflux_db sed -n '653,800p' doozer/doozerlib/backend/konflux_image_builder.py

Length of output: 6672

🏁 Script executed:

#!/bin/bash # Check if there's any query to check for existing records before insert rg -n 'SELECT.*record_id\|WHERE.*record_id\|EXISTS' doozer/doozerlib/backend/konflux_image_builder.py

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check all schema field definitions in konflux_db.py to see if record_id has UNIQUE mode rg -n 'SchemaField\|mode=' artcommon/artcommonlib/konflux/konflux_db.py

Length of output: 0

Critical issue: Retry decorator will create duplicate database records.

The @retry decorator on update_konflux_db (line 653) is unsafe because:

add_build() uses standard SQL INSERT INTO which does not deduplicate on retry (bigquery.py line 75 acknowledges this limitation)

Each retry creates a new record_id (randomly generated UUID) even for the same build

No duplicate detection or idempotency checks exist before inserting

Each retry attempt will insert a new row with identical build_id but different record_id

Fixes required:

Either use BigQuery streaming API with insertId for idempotency (as suggested in bigquery.py line 75), OR

Add pre-insert duplicate detection (query for existing build_id/record_id before insert), OR

Remove the retry decorator and handle failures differently, OR

Make record_id deterministic instead of random so retries are naturally idempotent

The same issue applies to the insert_rows_json() call at line 774 for the taskrun table.

🧰 Tools

🪛 Pylint (4.0.2)

[refactor] 654-654: Too many arguments (7/5)

(R0913)

[refactor] 654-654: Too many positional arguments (7/5)

(R0917)

[refactor] 654-654: Too many local variables (35/15)

(R0914)

[refactor] 654-654: Too many statements (56/50)

(R0915)

🤖 Prompt for AI Agents

In doozer/doozerlib/backend/konflux_image_builder.py around lines 653-654 (and the insert_rows_json call at ~774), the current @retry on update_konflux_db will cause duplicate DB rows because add_build() uses non-idempotent INSERTs and record_id is randomly generated; fix by removing the retry decorator and making the insert idempotent: generate a deterministic record_id (e.g., UUIDv5 or hash derived from build_id and table name) so retries insert the same primary key, and for the taskrun insert_rows_json use BigQuery streaming insertId or perform a pre-insert existence check (SELECT by build_id) to avoid duplicates; ensure error handling surfaces failures without automatic blind retries.

openshift-ci · 2025-11-17T09:00:51Z

@locriandev: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/security	`9b2165d`	link	false	`/test security`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot assigned vfreex Sep 24, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2025

locriandev force-pushed the no-build-retries branch from da22b82 to 5350cf1 Compare September 24, 2025 09:50

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2025

locriandev force-pushed the no-build-retries branch from 5350cf1 to b0d239e Compare November 14, 2025 10:12

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

locriandev force-pushed the no-build-retries branch from b0d239e to 9a6d295 Compare November 15, 2025 10:23

coderabbitai bot reviewed Nov 15, 2025

View reviewed changes

Limit Konflux build attempts to 1

9b2165d

locriandev force-pushed the no-build-retries branch from 9a6d295 to 9b2165d Compare November 17, 2025 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit Konflux build attempts to 1 #2032

Limit Konflux build attempts to 1 #2032

Uh oh!

locriandev commented Sep 23, 2025 •

edited

Loading

Uh oh!

vfreex commented Sep 24, 2025

Uh oh!

locriandev commented Sep 24, 2025

Uh oh!

openshift-ci bot commented Sep 24, 2025

Uh oh!

openshift-ci bot commented Sep 24, 2025

Uh oh!

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 14, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 15, 2025

Uh oh!

coderabbitai bot Nov 15, 2025

Uh oh!

openshift-ci bot commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@retry(reraise=True, wait=wait_fixed(5), stop=stop_after_attempt(3))
		async def update_konflux_db(

Limit Konflux build attempts to 1 #2032

Are you sure you want to change the base?

Limit Konflux build attempts to 1 #2032

Uh oh!

Conversation

locriandev commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfreex commented Sep 24, 2025

Uh oh!

locriandev commented Sep 24, 2025

Uh oh!

openshift-ci bot commented Sep 24, 2025

Uh oh!

openshift-ci bot commented Sep 24, 2025

Uh oh!

coderabbitai bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

locriandev commented Sep 23, 2025 •

edited

Loading

coderabbitai bot commented Nov 14, 2025 •

edited

Loading