Skip to content

Conversation

@zdrapela
Copy link
Member

@zdrapela zdrapela commented Dec 11, 2025

Description

This PR fixes CI/CD pipeline error handling issues that were causing:

  1. Scripts to exit prematurely when Playwright tests failed
  2. Cleanup function to run multiple times
  3. kubectl logs commands to hang indefinitely when pods were unresponsive

See this log, where the issues happened: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/redhat-developer_rhdh/3830/pull-ci-redhat-developer-rhdh-main-e2e-ocp-helm/1998787878628364288/artifacts/e2e-ocp-helm/redhat-developer-rhdh-ocp-helm/build-log.txt

Root Cause Analysis

The CI script was configured with set -o errexit and trap cleanup EXIT INT ERR. When combined with pipefail (enabled by configure_external_postgres_db()), this caused:

  1. Pipeline failures propagating: yarn playwright test | tee would fail the entire script when tests failed (due to pipefail)
  2. Double cleanup execution: Both ERR and EXIT traps fired on failures
  3. Hanging log collection: kubectl logs had no timeout, causing 40+ minute hangs when pods were stuck

Changes

openshift-ci-tests.sh

  • Simplified trap to EXIT only (removes INT and ERR)
  • EXIT trap fires exactly once on any termination, preventing duplicate cleanup

utils.sh

  • retrieve_pod_logs(): Added 30-second timeout to kubectl logs commands to prevent hanging
  • configure_external_postgres_db(): Removed unnecessary set -euo pipefail that was leaking globally

Expected Behavior

Scenario Before After
Playwright tests fail Script exits immediately Script continues, records failure
kubectl logs hangs Waits indefinitely (40+ min) Times out after 30 seconds
Cleanup on error Runs 2-3 times Runs exactly once

Which issue(s) does this PR fix

  • Fixes intermittent CI failures where scripts would hang or exit prematurely
  • Addresses build log issue

PR acceptance criteria

  • GitHub Actions are completed and successful
  • Unit Tests are updated and passing
  • E2E Tests are updated and passing
  • Documentation is updated if necessary (requirement for new features)
  • Add a screenshot if the change is UX/UI related

How to test changes / Special notes to the reviewer

  1. Run e2e tests and verify script continues even when some tests fail
  2. Verify cleanup runs only once in logs
  3. Verify no 40+ minute hangs on log collection

@zdrapela zdrapela changed the title chore(ci): fix error handling chore(ci): fix error handling & add timeout Dec 11, 2025
@gustavolira
Copy link
Member

/approve
/lgtm

@sonarqubecloud
Copy link

@gustavolira
Copy link
Member

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm label Dec 11, 2025
@openshift-ci
Copy link

openshift-ci bot commented Dec 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gustavolira

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

@zdrapela
Copy link
Member Author

/test e2e-ocp-helm

@zdrapela
Copy link
Member Author

@openshift-merge-bot openshift-merge-bot bot merged commit 8db5ff1 into redhat-developer:main Dec 11, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants