Skip to content

Conversation

@locriandev
Copy link
Contributor

@locriandev locriandev commented Sep 23, 2025

Currently, we try an image builds to a maximum of 3 times in a row if it fails. This has proven to be ineffective, and we mostly get 3 build failures in a row that only pollute our images health checks. Removing the iteration, while adding retries to functions that could fail to networking/communication issues should fix this.

Test build: https://art-jenkins.apps.prod-stable-spoke1-dc-iad2.itup.redhat.com/job/aos-cd-builds/job/build%252Focp4-konflux/17335/console

@vfreex
Copy link
Contributor

vfreex commented Sep 24, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2025
@locriandev
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 24, 2025

New changes are detected. LGTM label has been removed.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 24, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from vfreex. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 14, 2025

Walkthrough

Added tenacity retry decorators to Konflux DB access and to two async methods in the Konflux image builder; removed the builder's internal retry loop, added a retry-backed attestation/signature validation step, and made DB writes conditional on dry-run, with outcome-driven error handling.

Changes

Cohort / File(s) Summary
Database retry configuration
artcommon/artcommonlib/konflux/konflux_db.py
Imported tenacity utilities and decorated get_build_records_by_nvrs with @retry(reraise=True, wait=wait_fixed(5), stop=stop_after_attempt(3)).
Image builder refactor & validation
doozer/doozerlib/backend/konflux_image_builder.py
Removed the explicit retry loop around build attempts and now performs a single start_build call. Added async _validate_build_attestation_and_signature and update_konflux_db, both decorated with the same tenacity retry policy. Moved image_pullspec/digest/tag population to only on SUCCESS, added attestation/signature validation that can set outcome to FAILURE, consolidated DB update path honoring dry_run, and raise KonfluxImageBuildError for non-SUCCESS outcomes. Adjusted PipelineRun logging/timing.

Sequence Diagram

sequenceDiagram
    actor Builder
    participant ImageBuilder as KonfluxImageBuilder
    participant PipelineRun
    participant Attestation as Validation
    participant KonfluxDB

    Builder->>ImageBuilder: start_build()

    rect rgb(200,220,255)
    Note over ImageBuilder,PipelineRun: Single start_build invocation (no retry loop)
    ImageBuilder->>PipelineRun: create/start PipelineRun
    PipelineRun-->>ImageBuilder: PipelineRun info
    end

    rect rgb(220,240,220)
    Note over ImageBuilder,Attestation: Validation step (retry-decorated)
    ImageBuilder->>Attestation: _validate_build_attestation_and_signature(image_pullspec, distgit_key)
    Attestation->>Attestation: verify SLSA attestation & source signature (retries)
    alt validation succeeds
        Attestation-->>ImageBuilder: validation OK
    else validation fails
        Attestation-->>ImageBuilder: validation FAILED -> outcome = FAILURE
    end
    end

    rect rgb(240,240,200)
    Note over ImageBuilder,KonfluxDB: Conditional DB update (retry-decorated)
    alt dry_run == false
        ImageBuilder->>KonfluxDB: update_konflux_db(metadata, ..., outcome) (retries)
        KonfluxDB-->>ImageBuilder: updated
    else dry_run == true
        ImageBuilder->>ImageBuilder: skip DB write
    end
    end

    alt outcome == SUCCESS
        ImageBuilder-->>Builder: return success + metadata
    else outcome != SUCCESS
        ImageBuilder-->>Builder: raise KonfluxImageBuildError
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Review focus:
    • doozer/doozerlib/backend/konflux_image_builder.py: control-flow changes, new async validation method, retry decorators, dry-run DB path, and metadata population timing.
    • artcommon/artcommonlib/konflux/konflux_db.py: correct tenacity usage on get_build_records_by_nvrs.
    • Verify async retry semantics, exception handling, and logging positions relative to state updates.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9a6d295 and 9b2165d.

📒 Files selected for processing (2)
  • artcommon/artcommonlib/konflux/konflux_db.py (2 hunks)
  • doozer/doozerlib/backend/konflux_image_builder.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • artcommon/artcommonlib/konflux/konflux_db.py
🧰 Additional context used
🪛 Ruff (0.14.4)
doozer/doozerlib/backend/konflux_image_builder.py

227-227: Do not catch blind exception: Exception

(BLE001)


228-230: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: tests
🔇 Additional comments (3)
doozer/doozerlib/backend/konflux_image_builder.py (3)

203-208: LGTM: Clean wait logic with appropriate logging.

The waiting sequence is clear and well-logged.


233-249: LGTM: Appropriate error handling and dry-run check.

The conditional DB update based on dry-run mode and the outcome-driven error handling are correctly implemented in this segment.


295-295: LGTM: Retry decorator appropriate for transient network failures.

The retry decorator on _validate_build_attestation_and_signature is well-suited for handling transient failures when fetching SLSA attestations and signatures from the registry.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
doozer/doozerlib/backend/konflux_image_builder.py (1)

653-719: update_konflux_db retry scope and build_priority casting require fixes

Two behavioral issues confirmed in the implementation:

  1. build_priority may be None and cause int(build_priority) to fail
    When self._config.build_priority defaults to None, the logic at lines 173–179 assigns None directly to build_priority (the else branch executes when not "auto"). This None value flows through to update_konflux_db and reaches line 718 where int(build_priority) raises a TypeError. The retry decorator then wastes 15 seconds retrying this deterministic crash.

    Fix: Add validation in build() to treat None as "auto", or raise a clear configuration error early if build_priority is not set to an acceptable value.

  2. @retry retries all exceptions, including persistent deterministic errors
    The decorator has no retry_on parameter, so it retries any exception—including KeyError from missing Dockerfile labels (lines 674–679) and ValueError from schema validation in add_build(). These failures are deterministic and won't be fixed by retries; retrying only adds delay and can cause multiple inserts if failure occurs after a partial side-effect.

    Fix: Restrict retries to transient failures only (e.g., BigQuery transport errors), not configuration/validation errors. Check add_build() error types and configure @retry to only handle known transient exceptions.

🧹 Nitpick comments (2)
artcommon/artcommonlib/konflux/konflux_db.py (1)

967-1024: Narrow retry conditions for get_build_records_by_nvrs to avoid retrying non-transient errors

Decorating get_build_records_by_nvrs with tenacity is a good fit for transient BigQuery / networking issues, and the aggregation of per-NVR errors into a single IOError is clear.

Right now, the default @retry behavior will treat all exceptions as retriable, including:

  • The local ValueError raised when where contains nvr or outcome.
  • Any IOError coming from strict “record not found” conditions in get_latest_build (if this method is used with strict=True).

That means callers can see unnecessary 10s+ delays before surfacing deterministic errors.

Consider tightening the retry condition, for example by:

  • Using retry=retry_if_exception_type(<transient-exception-types>) (e.g. IO/transport/BigQuery client errors), and/or
  • Separating “not found” from genuine connectivity failures (e.g. with a custom exception type) so only the latter are retried.

This keeps the new resilience benefits while avoiding retries for parameter validation and expected “not found” cases.

doozer/doozerlib/backend/konflux_image_builder.py (1)

295-337: Scope the retry behavior in _validate_build_attestation_and_signature to truly transient failures

Adding _validate_build_attestation_and_signature and decorating it with tenacity is a good way to encapsulate SLSA + signature validation and handle transient issues from fetch_slsa_attestation / get_konflux_data.

A couple of refinements:

  1. Retrying deterministic validation errors

    • The function raises ValueError for an empty attestation and for a missing IMAGE_REF in the tasks.
    • With the default @retry(reraise=True, ...), these ValueErrors are also treated as retriable and will be re-run 2 extra times with 5-second waits, even though they are deterministic once the attestation payload is fetched.

    To keep retries focused on genuinely transient problems (CLI/network/registry hiccups), consider restricting the retry condition to the expected transient exception types (e.g., ChildProcessError from the underlying commands, or specific I/O/transport errors).

  2. Exception logging at callsite

    • In build(), you currently do except Exception as e: and log via logger.error(...). This swallows the traceback and triggers the Ruff hint.
    • Switching to logger.exception(...) there would both address the lint warning and make it easier to debug unexpected exceptions from this helper.

These changes keep the resilience benefits but avoid unnecessary delays when validation fails due to data/contract issues rather than transport flakiness.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between cd5d4cc and b0d239e.

📒 Files selected for processing (2)
  • artcommon/artcommonlib/konflux/konflux_db.py (2 hunks)
  • doozer/doozerlib/backend/konflux_image_builder.py (3 hunks)
🧰 Additional context used
🪛 Ruff (0.14.4)
doozer/doozerlib/backend/konflux_image_builder.py

227-227: Do not catch blind exception: Exception

(BLE001)


228-230: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: tests
🔇 Additional comments (3)
artcommon/artcommonlib/konflux/konflux_db.py (1)

20-20: Tenacity import wiring looks fine

Importing retry, stop_after_attempt, and wait_fixed here lines up with the new retry decorator on get_build_records_by_nvrs and keeps retry semantics co-located with the DB client.

doozer/doozerlib/backend/konflux_image_builder.py (2)

35-35: Tenacity import aligns with new retryable operations

The tenacity import here is consistent with the new retry decorators on _validate_build_attestation_and_signature, update_konflux_db, and the internal SBOM helper, which matches the PR goal of moving retries closer to the potentially flaky I/O.


570-605: Tenacity on _get_sbom_with_retry is a good fit for flaky cosign calls

The inner _get_sbom_with_retry helper wrapped with @retry(stop=stop_after_attempt(3), wait=wait_fixed(5)) is an appropriate place to add retries: it’s tightly scoped around the cosign CLI call and SBOM validation and has no additional side effects.

No changes needed here; this is a clean use of tenacity for transient failures.

Comment on lines +181 to +201
pipelinerun_info = await self._start_build(
metadata=metadata,
build_repo=build_repo,
building_arches=building_arches,
output_image=output_image,
additional_tags=additional_tags,
nvr=nvr,
build_priority=build_priority,
dest_dir=dest_dir,
)
pipelinerun_name = pipelinerun_info.name
record["task_id"] = pipelinerun_name
record["task_url"] = self._konflux_client.resource_url(pipelinerun_info.to_dict())
await self.update_konflux_db(
metadata,
build_repo,
pipelinerun_info,
KonfluxBuildOutcome.PENDING,
building_arches,
build_priority,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Dry-run still writes a PENDING record and error handling around outcome could be tightened

The refactored single-attempt build sequence (start → wait → compute outcome → validate attestation/signature → write final status) is much clearer and matches the intent to avoid multiple PipelineRun creations.

A couple of behavioral issues to consider:

  1. Dry-run mode still writes to the Konflux DB (PENDING record)

    • You always call await self.update_konflux_db(..., outcome=KonfluxBuildOutcome.PENDING, ...) right after starting the build (Line 194), regardless of self._config.dry_run.
    • Later, in Lines 233–238, you skip the final DB write when dry_run is true, but the initial PENDING write still happens.
    • The log message “Dry run: Would have inserted build record in Konflux DB” suggests the intent is to avoid DB writes entirely in dry-run mode.

    If dry-run is meant to be side-effect free w.r.t BigQuery, you likely want either:

    • To guard the initial PENDING write with the same if self._config.dry_run condition, or
    • To make update_konflux_db itself a no-op (early-return) when self._config.dry_run is true.

    Otherwise, a dry-run will leave a PENDING record in the DB, which may pollute metrics and tooling that depend on build status.

  2. Outcome / error handling and logging on validation failure

    • When _validate_build_attestation_and_signature fails, you set outcome = KonfluxBuildOutcome.FAILURE and proceed to the final DB update, which is the right behavior from a releaseability standpoint.
    • However, for failures here you only log logger.error(...) without a traceback. Given these errors are often subtle (attestation shape, JSON content, signatures), a full stack trace will make debugging much easier.

    Consider switching that logger call to logger.exception(...) so the stack trace is captured, or at least using LOGGER.exception with the same message.

Overall flow looks good; fixing the dry-run DB write and improving logging here will better align behavior with expectations and make failures easier to diagnose.

Also applies to: 233-249


I'll investigate the code to verify the concerns raised in the review comment.


---

🏁 Script executed:

```shell
#!/bin/bash
# Extract the function containing lines 181-201 to see full context
cat doozer/doozerlib/backend/konflux_image_builder.py | head -300 | tail -130

Length of output: 6143


🏁 Script executed:

#!/bin/bash
# Show lines 233-249 explicitly
sed -n '233,249p' doozer/doozerlib/backend/konflux_image_builder.py

Length of output: 764


🏁 Script executed:

#!/bin/bash
# Find update_konflux_db method implementation
ast-grep --pattern $'function update_konflux_db($_, $_) {
  $$$
}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for _validate_build_attestation_and_signature calls and error handling
rg -n '_validate_build_attestation_and_signature' -A 10 -B 2

Length of output: 2771


🏁 Script executed:

#!/bin/bash
# Search for dry_run checks in this file
rg -n 'dry_run' doozer/doozerlib/backend/konflux_image_builder.py -B 2 -A 2

Length of output: 834



---

🏁 Script executed:

```shell
#!/bin/bash
# Search for update_konflux_db method definition
rg -n 'def update_konflux_db' -A 15

Length of output: 5252


🏁 Script executed:

#!/bin/bash
# Also check if it's an async method
rg -n 'async def update_konflux_db' -A 15

Length of output: 2559


🏁 Script executed:

#!/bin/bash
# Check the full flow from line 194 to 237 - extract exact line numbers
sed -n '190,240p' doozer/doozerlib/backend/konflux_image_builder.py | cat -n

Length of output: 2891


Guard initial PENDING record write with dry-run check, and improve error logging for attestation validation failures

Two confirmed issues:

  1. Dry-run writes inconsistent PENDING records to Konflux DB (line 194)
    The unguarded await self.update_konflux_db(..., KonfluxBuildOutcome.PENDING, ...) at line 194 always executes, even when self._config.dry_run is true. The final DB write (lines 233–238) correctly skips on dry-run, but the initial PENDING write doesn't, leaving stale records. Wrap line 194 with if not self._config.dry_run: or move the check inside update_konflux_db.

  2. Missing stack trace for attestation/signature validation errors (lines 226–230)
    When _validate_build_attestation_and_signature fails, the exception is caught and logged using logger.error(...) with only the stringified error message. Switch to logger.exception(...) to capture the full stack trace, which is essential for debugging signature and attestation shape issues.

🤖 Prompt for AI Agents
In doozer/doozerlib/backend/konflux_image_builder.py around lines 181 to 201,
the initial PENDING DB write at line ~194 is executed even during dry-run and
the attestation/signature validation error logging (around lines 226–230) loses
stack traces; guard the initial update_konflux_db call with a dry-run check
(e.g., if not self._config.dry_run: await self.update_konflux_db(...,
KonfluxBuildOutcome.PENDING, ...)) or move the dry-run check inside
update_konflux_db so no PENDING record is written during dry runs, and replace
logger.error(str(err)) in the exception handler for
_validate_build_attestation_and_signature with logger.exception(...) (or
logger.error(..., exc_info=True)) so the full stack trace is recorded.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
doozer/doozerlib/backend/konflux_image_builder.py (1)

181-201: Guard the initial PENDING DB write with dry-run check to prevent stale records.

The initial call to update_konflux_db with KonfluxBuildOutcome.PENDING (line 194) executes unconditionally, even when self._config.dry_run is true. However, the final DB write at lines 233-238 correctly skips updates during dry-run. This inconsistency leaves stale PENDING records in the Konflux DB during dry-run mode, polluting metrics and health checks.

Wrap the initial DB write with a dry-run check:

 pipelinerun_name = pipelinerun_info.name
 record["task_id"] = pipelinerun_name
 record["task_url"] = self._konflux_client.resource_url(pipelinerun_info.to_dict())
-await self.update_konflux_db(
-    metadata,
-    build_repo,
-    pipelinerun_info,
-    KonfluxBuildOutcome.PENDING,
-    building_arches,
-    build_priority,
-)
+if not self._config.dry_run:
+    await self.update_konflux_db(
+        metadata,
+        build_repo,
+        pipelinerun_info,
+        KonfluxBuildOutcome.PENDING,
+        building_arches,
+        build_priority,
+    )
+else:
+    logger.info("Dry run: Skipping initial PENDING DB write")

Alternatively, move the dry-run check inside update_konflux_db itself to centralize the logic and ensure all DB writes respect the flag.

Based on learnings

🧹 Nitpick comments (1)
doozer/doozerlib/backend/konflux_image_builder.py (1)

295-296: Retry decorator should use exponential backoff with jitter per best practices.

The current implementation uses wait_fixed(5) for all retries. Per Tenacity best practices and AWS guidance, fixed rapid retries can trigger retry storms and thundering-herd problems. Industry standards recommend exponential backoff with jitter (e.g., wait_random_exponential(multiplier=1, max=30)) to avoid server overload, especially for external registry operations.

Consider refactoring to:

@retry(reraise=True, wait=wait_random_exponential(multiplier=1, max=30), stop=stop_after_attempt(3) | stop_after_delay(60))
async def _validate_build_attestation_and_signature(self, image_pullspec: str, distgit_key: str):

This pattern bounds both attempts and total latency while respecting exponential backoff conventions for registry communication.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between b0d239e and 9a6d295.

📒 Files selected for processing (2)
  • artcommon/artcommonlib/konflux/konflux_db.py (2 hunks)
  • doozer/doozerlib/backend/konflux_image_builder.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • artcommon/artcommonlib/konflux/konflux_db.py
🧰 Additional context used
🪛 Ruff (0.14.4)
doozer/doozerlib/backend/konflux_image_builder.py

227-227: Do not catch blind exception: Exception

(BLE001)


228-230: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: tests
🔇 Additional comments (1)
doozer/doozerlib/backend/konflux_image_builder.py (1)

233-249: LGTM: Correct dry-run handling and outcome processing.

The final DB write correctly guards against dry-run mode, and the outcome handling properly sets build status and raises errors when needed. This pattern should be consistently applied to the initial PENDING write as well (see comment on lines 181-201).

Comment on lines +213 to +232
# Even if the build succeeded, if the SLSA attestation cannot be retrieved, it is unreleasable.
if outcome is KonfluxBuildOutcome.SUCCESS:
results = pipelinerun_info.to_dict().get('status', {}).get('results', [])
image_pullspec = next((r['value'] for r in results if r['name'] == 'IMAGE_URL'), None)
image_digest = next((r['value'] for r in results if r['name'] == 'IMAGE_DIGEST'), None)

image_tag = image_pullspec.split(':')[-1]
record["image_tag"] = image_tag
record["image_pullspec"] = f"{image_pullspec.split(':')[0]}@{image_digest}"

# Validate SLSA attestation and source image signature
try:
await self._validate_build_attestation_and_signature(image_pullspec, metadata.distgit_key)
except Exception as e:
logger.error(
f"Failed to get SLA attestation / source signature from konflux for image {image_pullspec}, marking build as {KonfluxBuildOutcome.FAILURE}. Error: {e}"
)
outcome = KonfluxBuildOutcome.FAILURE
image_tag = image_pullspec.split(':')[-1]
record["image_tag"] = image_tag

if self._config.dry_run:
logger.info("Dry run: Would have inserted build record in Konflux DB")
else:
await self.update_konflux_db(
metadata, build_repo, pipelinerun_info, outcome, building_arches, build_priority
# Validate SLSA attestation and source image signature
try:
await self._validate_build_attestation_and_signature(image_pullspec, metadata.distgit_key)
except Exception as e:
logger.error(
f"Failed to get SLA attestation / source signature from konflux for image {image_pullspec}, marking build as {KonfluxBuildOutcome.FAILURE}. Error: {e}"
)
outcome = KonfluxBuildOutcome.FAILURE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add defensive null checks and improve exception logging.

Two issues in the success path:

  1. Missing null checks before .split() calls (lines 216-221)
    Both image_pullspec and image_digest are extracted using next() with default None, but lines 219 and 221 immediately call .split() without validation. If either value is None, this will raise AttributeError with an unclear error message. The update_konflux_db method (lines 732-736) already implements the correct defensive check:

    if not (image_pullspec and image_digest):
        raise ValueError(...)

    Mirror this validation here before any .split() operations.

  2. Exception handling loses stack traces (lines 225-230)
    The broad Exception catch (flagged by Ruff BLE001) using logger.error(...) (flagged by Ruff TRY400) discards the stack trace, making attestation and signature validation failures difficult to debug. Switch to logger.exception(...) to preserve the full traceback.

Apply these fixes:

 if outcome is KonfluxBuildOutcome.SUCCESS:
     results = pipelinerun_info.to_dict().get('status', {}).get('results', [])
     image_pullspec = next((r['value'] for r in results if r['name'] == 'IMAGE_URL'), None)
     image_digest = next((r['value'] for r in results if r['name'] == 'IMAGE_DIGEST'), None)
 
+    if not (image_pullspec and image_digest):
+        raise ValueError(
+            f"[{metadata.distgit_key}] Could not find expected results in konflux "
+            f"pipelinerun {pipelinerun_name}"
+        )
+
     record["image_pullspec"] = f"{image_pullspec.split(':')[0]}@{image_digest}"
 
     image_tag = image_pullspec.split(':')[-1]
     record["image_tag"] = image_tag
 
     # Validate SLSA attestation and source image signature
     try:
         await self._validate_build_attestation_and_signature(image_pullspec, metadata.distgit_key)
-    except Exception as e:
-        logger.error(
+    except Exception:
+        logger.exception(
             f"Failed to get SLA attestation / source signature from konflux for image {image_pullspec}, marking build as {KonfluxBuildOutcome.FAILURE}. Error: {e}"
         )
         outcome = KonfluxBuildOutcome.FAILURE

Based on learnings

🧰 Tools
🪛 Ruff (0.14.4)

227-227: Do not catch blind exception: Exception

(BLE001)


228-230: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
In doozer/doozerlib/backend/konflux_image_builder.py around lines 213 to 232,
add defensive null checks for image_pullspec and image_digest before calling
.split(): if either is falsy, raise a ValueError with a clear message (mirror
the validation in update_konflux_db) so you don't call .split() on None; after
successful validation, compute image_pullspec, image_digest, image_tag and set
record fields as before. Replace the logger.error(...) in the except block with
logger.exception(...) to preserve the full traceback, then set outcome =
KonfluxBuildOutcome.FAILURE as currently done.

Comment on lines +653 to 654
@retry(reraise=True, wait=wait_fixed(5), stop=stop_after_attempt(3))
async def update_konflux_db(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

LGTM: Retry decorator handles transient DB connection failures.

The retry decorator appropriately targets database write operations that may fail due to transient network or connection issues. The method writes to both Konflux DB (line 760) and BigQuery (line 774), and retrying on failure prevents build record loss.

Verify that the underlying DB operations (add_build and insert_rows_json) are idempotent or have proper duplicate detection to prevent multiple records on retry:


🏁 Script executed:

#!/bin/bash
# Check if konflux_db.add_build has duplicate handling
rg -n -B5 -A15 'def add_build' --type py

Length of output: 5627



---

🏁 Script executed:

```shell
#!/bin/bash
# Find bq_client initialization and how insert is called
rg -n -B10 -A5 'self.bq_client\s*=' --type py

Length of output: 1511


🏁 Script executed:

#!/bin/bash
# Check for table schema definition and unique constraints
rg -n -B5 -A10 'SchemaField\|PRIMARY\|UNIQUE\|constraint' --type py | head -100

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Find insert_rows_json usage in update_konflux_db
rg -n -B5 -A10 'insert_rows_json' --type py

Length of output: 3234



---

🏁 Script executed:

```shell
#!/bin/bash
# Check table schema creation - look for unique or primary key constraints
rg -n 'CREATE TABLE\|build_id\|record_id' artcommon/artcommonlib/konflux/konflux_db.py -A20

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check the KonfluxRecord/KonfluxBuildRecord structure for identifiers
rg -n -A30 'class KonfluxBuildRecord\|class KonfluxRecord' artcommon/artcommonlib/konflux/

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check line 760 context - what exactly calls add_build and whether it checks for existing records
sed -n '755,765p' doozer/doozerlib/backend/konflux_image_builder.py

Length of output: 615



---

🏁 Script executed:

```shell
#!/bin/bash
# Search for KonfluxBuildRecord definition more broadly
find . -name '*.py' -type f | xargs grep -l 'class KonfluxBuildRecord' 2>/dev/null

Length of output: 89


🏁 Script executed:

#!/bin/bash
# Search for KonfluxRecord definition
find . -name '*.py' -type f | xargs grep -l 'class KonfluxRecord' 2>/dev/null

Length of output: 56


🏁 Script executed:

#!/bin/bash
# Check artcommon/artcommonlib/konflux directory structure
fd . artcommon/artcommonlib/konflux --type f -name '*.py'

Length of output: 230


🏁 Script executed:

#!/bin/bash
# Check if there's a migration file or schema file that defines the table
fd 'konflux.*\.py\|schema' . --type f | grep -v __pycache__

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Examine KonfluxBuildRecord class definition
cat artcommon/artcommonlib/konflux/konflux_build_record.py

Length of output: 15785


🏁 Script executed:

#!/bin/bash
# Check BigQueryClient implementation for insert method
cat artcommon/artcommonlib/bigquery.py | head -150

Length of output: 4840



---

🏁 Script executed:

```shell
#!/bin/bash
# Check for any PRIMARY KEY or UNIQUE constraints on record_id in table creation
rg -n -B10 -A20 'record_id\|PRIMARY\|UNIQUE' artcommon/artcommonlib/konflux/konflux_db.py | grep -A20 "def get_schema\|SchemaField"

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Verify whether record_id field is marked as unique in schema generation
rg -n -C5 'generate_build_schema\|get_schema' artcommon/artcommonlib/konflux/

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Look at the full add_build implementation
sed -n '481,509p' artcommon/artcommonlib/konflux/konflux_db.py

Length of output: 913


🏁 Script executed:

#!/bin/bash
# Search for any duplicate checking or idempotency logic in update_konflux_db
sed -n '653,800p' doozer/doozerlib/backend/konflux_image_builder.py

Length of output: 6672


🏁 Script executed:

#!/bin/bash
# Check if there's any query to check for existing records before insert
rg -n 'SELECT.*record_id\|WHERE.*record_id\|EXISTS' doozer/doozerlib/backend/konflux_image_builder.py

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check all schema field definitions in konflux_db.py to see if record_id has UNIQUE mode
rg -n 'SchemaField\|mode=' artcommon/artcommonlib/konflux/konflux_db.py

Length of output: 0


Critical issue: Retry decorator will create duplicate database records.

The @retry decorator on update_konflux_db (line 653) is unsafe because:

  1. add_build() uses standard SQL INSERT INTO which does not deduplicate on retry (bigquery.py line 75 acknowledges this limitation)
  2. Each retry creates a new record_id (randomly generated UUID) even for the same build
  3. No duplicate detection or idempotency checks exist before inserting
  4. Each retry attempt will insert a new row with identical build_id but different record_id

Fixes required:

  • Either use BigQuery streaming API with insertId for idempotency (as suggested in bigquery.py line 75), OR
  • Add pre-insert duplicate detection (query for existing build_id/record_id before insert), OR
  • Remove the retry decorator and handle failures differently, OR
  • Make record_id deterministic instead of random so retries are naturally idempotent

The same issue applies to the insert_rows_json() call at line 774 for the taskrun table.

🧰 Tools
🪛 Pylint (4.0.2)

[refactor] 654-654: Too many arguments (7/5)

(R0913)


[refactor] 654-654: Too many positional arguments (7/5)

(R0917)


[refactor] 654-654: Too many local variables (35/15)

(R0914)


[refactor] 654-654: Too many statements (56/50)

(R0915)

🤖 Prompt for AI Agents
In doozer/doozerlib/backend/konflux_image_builder.py around lines 653-654 (and
the insert_rows_json call at ~774), the current @retry on update_konflux_db will
cause duplicate DB rows because add_build() uses non-idempotent INSERTs and
record_id is randomly generated; fix by removing the retry decorator and making
the insert idempotent: generate a deterministic record_id (e.g., UUIDv5 or hash
derived from build_id and table name) so retries insert the same primary key,
and for the taskrun insert_rows_json use BigQuery streaming insertId or perform
a pre-insert existence check (SELECT by build_id) to avoid duplicates; ensure
error handling surfaces failures without automatic blind retries.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 17, 2025

@locriandev: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 9b2165d link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants