WIP: OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1 by bitoku · Pull Request #31013 · openshift/origin

bitoku · 2026-04-15T13:26:54Z

This test runs after all other tests to capture CRI-O's goroutine state on every node, providing visibility into stuck operations after the suite has exercised CRI-O under load.

The test sends SIGUSR1 to CRI-O on each node, reads the resulting goroutine dump from /tmp/crio-goroutine-stacks-*.log, and fails hard if any goroutine is stuck in an image pull (IO wait > 30 min in docker.(*bodyReader).Read -> net.(*conn).Read).

Summary by CodeRabbit

Tests
- Added a diagnostic node-level test that collects container runtime goroutine dumps from each node, detects stuck image-pull goroutines with prolonged I/O wait, and fails if nodes do not produce dumps or if stuck goroutines are found. Collected dump output is included in test results to aid investigation; the test is non-blocking/informing.

openshift-merge-bot · 2026-04-15T13:26:56Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-04-15T13:27:01Z

@bitoku: This pull request references Jira Issue OCPBUGS-79544, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This test runs after all other tests to capture CRI-O's goroutine state on every node, providing visibility into stuck operations after the suite has exercised CRI-O under load.

The test sends SIGUSR1 to CRI-O on each node, reads the resulting goroutine dump from /tmp/crio-goroutine-stacks-*.log, and fails hard if any goroutine is stuck in an image pull (IO wait > 30 min in docker.(*bodyReader).Read -> net.(*conn).Read).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-04-15T13:27:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bitoku
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/extended/node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-15T13:27:20Z

Walkthrough

Adds a new Ginkgo node test and helper that trigger CRI‑O to write goroutine dumps on each cluster node, collect and parse those dumps for long IO‑wait goroutines related to image pulls, and fail if any node has missing dumps or stuck image‑pull goroutines.

Changes

Cohort / File(s)	Summary
CRI-O Goroutine Dump Test `test/extended/node/crio_goroutinedump.go`	New test file. Adds `findStuckImagePulls(dump string) []string` which splits CRI‑O goroutine dumps into goroutine blocks, extracts `IO wait` durations, filters for IO wait > 30 minutes, and matches stacks where `net.(conn).Read` appears before `image/v5/docker.(bodyReader).Read`. Adds a Ginkgo `[sig-node][Late]` test that finds CRI‑O PID on each node, sends `SIGUSR1`, waits for `/tmp/crio-goroutine-stacks-*.log`, retrieves dumps (handling PrivateTmp), aggregates per-node results and failure reasons, asserts every node produced a dump, and fails if any stuck image‑pull goroutines are detected (emitting collected dumps).

Sequence Diagram

sequenceDiagram
    participant TestRunner as Test Runner
    participant Node as Cluster Node
    participant CRIO as CRI‑O Process
    participant Parser as Dump Parser

    TestRunner->>Node: run shell to find CRI‑O PID
    Node-->>TestRunner: PID or CRIO_NOT_FOUND
    TestRunner->>CRIO: send SIGUSR1 to PID
    activate CRIO
    CRIO->>CRIO: write /tmp/crio-goroutine-stacks-*.log
    CRIO-->>TestRunner: dump file created (or timeout)
    deactivate CRIO
    TestRunner->>Node: fetch newest dump file
    Node-->>TestRunner: goroutine dump content or DUMP_TIMEOUT
    TestRunner->>Parser: call findStuckImagePulls(dump)
    Parser->>Parser: split into goroutine blocks
    Parser->>Parser: extract IO wait minutes, filter > 30m
    Parser->>Parser: match stack pattern (net.(*conn).Read -> image/v5/docker.(*bodyReader).Read)
    Parser-->>TestRunner: list of stuck goroutines
    alt stuck goroutines found
        TestRunner->>TestRunner: fail test and emit dumps
    else no stuck goroutines
        TestRunner->>TestRunner: pass
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Structure And Quality	⚠️ Warning	Test fails unbounded concurrency requirements: spawns debug pods for all nodes simultaneously without semaphore limiting (lines 115-126), acknowledged in review comments as causing flakiness on larger clusters. Also lacks error message on assertion at line 80.	Add buffered semaphore channel to limit concurrent debug pods (e.g., sem := make(chan struct{}, 10)), add failure message to GetAllClusterNodes assertion, and wrap test logic in context with timeout to prevent indefinite waits.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a [Late] test for collecting CRI-O goroutine dumps via SIGUSR1, matching the actual changeset which introduces this test.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Stable And Deterministic Test Names	✅ Passed	The test names '[sig-node][Late]' and 'CRI-O should report goroutine stacks on all nodes' are static and deterministic with no dynamic content.
Microshift Test Compatibility	✅ Passed	The test crio_goroutinedump.go uses only standard Kubernetes APIs and kubectl debug commands compatible with MicroShift, with no OpenShift-specific dependencies or HA assumptions.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	Test uses GetAllClusterNodes() without multi-node assertions and functions correctly on SNO with one node.
Topology-Aware Scheduling Compatibility	✅ Passed	Test adds a Ginkgo test file that dynamically enumerates all cluster nodes and runs debug pods without assuming any specific HA topology or introducing scheduling constraints.
Ote Binary Stdout Contract	✅ Passed	The test file contains no direct stdout writes. All output operations use Ginkgo/Gomega framework functions (g.By, g.Fail, o.Expect) and fmt.Sprintf for string construction only.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	Test file contains no IPv4 assumptions, hardcoded IP addresses, or external connectivity requirements. All operations are cluster-internal and node-local.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bitoku · 2026-04-15T13:27:28Z

/test

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 84-88: The shellCmd currently grabs the newest
/tmp/crio-goroutine-stacks-*.log which can be a stale file or empty if none
exists; update the logic in the shellCmd string in crio_goroutinedump.go to
record the list (or latest mtime) of existing matching files before sending kill
-USR1, then after signalling poll/wait (with timeout) for a new file whose mtime
is greater than the pre-signal timestamp (or not in the pre-signal list), and
only then cat that new file; ensure the command returns a non-zero exit and
clear error message if no new dump appears within the timeout to avoid silent
bad reads.
- Around line 90-124: The loop currently treats missing CRI-O dumps as warnings
and continues, allowing partial node coverage to pass; modify the logic in the
loop that calls exutil.DebugNodeRetryWithOptionsAndChroot (the block handling
"CRIO_NOT_FOUND" and errors) to instead record failures (e.g., collect node
names into a missingDumps slice) and after the loop assert that missingDumps is
empty (failing the test with a clear message), or alternatively pre-filter the
nodes slice to only include nodes expected to run CRI-O before invoking
exutil.DebugNodeRetryWithOptionsAndChroot; update references to nodes, dumps,
stuckPulls, goroutineHeaderRe and findStuckImagePulls accordingly so
missing/filtered nodes are handled deterministically and the test fails if any
expected node did not produce a dump.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f0f2719c-5fb3-41e3-b81a-83372c196fdc

📥 Commits

Reviewing files that changed from the base of the PR and between 9df27cd and 8c04a45.

📒 Files selected for processing (1)

test/extended/node/crio_goroutinedump.go

openshift-merge-bot · 2026-04-15T14:49:10Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-trt · 2026-04-15T20:17:04Z

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 1cc9cb0

Job Name	New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit

New tests seen in this PR at sha: 1cc9cb0

"[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" [Total: 9, Pass: 7, Fail: 2, Flake: 0]

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 86-93: The MARKER temporary file created by MARKER=$(mktemp) can
leak if the script exits early (e.g., kill fails); add a shell trap immediately
after creating MARKER to ensure it's removed on all exit paths (trap 'rm -f
"$MARKER"' EXIT) and optionally on INT and TERM, then proceed with the existing
kill -USR1 $CRIO_PID and wait loop; update the block containing MARKER, kill
-USR1 $CRIO_PID and the find/seq loop to rely on that trap so the marker is
always cleaned up even on early failure.
- Around line 143-147: The current block unconditionally calls g.Fail when dumps
are present which causes successful runs to fail; modify the CRI-O goroutine
dump handling in the section that references dumps and g.Fail so that dumps are
written to GinkgoWriter (or test logs/artifacts) instead of failing the test,
and only call g.Fail if there are actual fault conditions such as failedNodes or
stuckPulls being non-empty; update the fmt.Sprintf/g.Fail call to use
GinkgoWriter.Printf (or similar) to emit the joined dumps and add a conditional
that preserves failing behavior only when failedNodes or stuckPulls indicate a
real error.
- Around line 123-125: The test currently uses o.Expect(...).To(...) inside the
node loop which aborts the spec on the first malformed CRI-O dump; instead,
introduce a failedNodes collection (e.g., []string failedNodes) declared before
iterating nodes, replace the immediate
o.Expect(goroutineHeaderRe.MatchString(output)).To(...) check inside the loop
with an if !goroutineHeaderRe.MatchString(output) branch that appends a
descriptive failure message (including node.Name and output) to failedNodes and
continues, and after the node loop add a single assertion (or
o.Expect(len(failedNodes)).To(o.Equal(0))) that fails the spec if failedNodes is
non-empty and includes the aggregated failedNodes report in the failure message.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 365ad2c5-fee6-4164-bdd6-d5f0aabddbc0

📥 Commits

Reviewing files that changed from the base of the PR and between 1cc9cb0 and 0418318.

📒 Files selected for processing (1)

test/extended/node/crio_goroutinedump.go

coderabbitai · 2026-04-16T00:36:27Z

+MARKER=$(mktemp) && \
+kill -USR1 $CRIO_PID && \
+for i in $(seq 1 10); do \
+  NEW=$(find /tmp -maxdepth 1 -name 'crio-goroutine-stacks-*.log' -newer "$MARKER" 2>/dev/null | head -1) && \
+  if [ -n "$NEW" ]; then rm -f "$MARKER"; cat "$NEW"; exit 0; fi; \
+  sleep 1; \
+done && \
+rm -f "$MARKER" && echo "DUMP_TIMEOUT" && exit 1`


⚠️ Potential issue | 🟡 Minor

Ensure marker cleanup on every exit path.

MARKER is removed in success/timeout branches, but if the command exits early after mktemp (e.g., kill failure), it can leak files in /tmp. Add a shell trap right after marker creation.

Suggested fix

-MARKER=$(mktemp) && \ -kill -USR1 $CRIO_PID && \ +MARKER=$(mktemp) && \ +trap 'rm -f "$MARKER"' EXIT && \ +kill -USR1 $CRIO_PID && \ for i in $(seq 1 10); do \ NEW=$(find /tmp -maxdepth 1 -name 'crio-goroutine-stacks-*.log' -newer "$MARKER" 2>/dev/null | head -1) && \ - if [ -n "$NEW" ]; then rm -f "$MARKER"; cat "$NEW"; exit 0; fi; \ + if [ -n "$NEW" ]; then cat "$NEW"; exit 0; fi; \ sleep 1; \ done && \ -rm -f "$MARKER" && echo "DUMP_TIMEOUT" && exit 1` +echo "DUMP_TIMEOUT" && exit 1`

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/extended/node/crio_goroutinedump.go` around lines 86 - 93, The MARKER temporary file created by MARKER=$(mktemp) can leak if the script exits early (e.g., kill fails); add a shell trap immediately after creating MARKER to ensure it's removed on all exit paths (trap 'rm -f "$MARKER"' EXIT) and optionally on INT and TERM, then proceed with the existing kill -USR1 $CRIO_PID and wait loop; update the block containing MARKER, kill -USR1 $CRIO_PID and the find/seq loop to rely on that trap so the marker is always cleaned up even on early failure.

coderabbitai · 2026-04-16T00:36:27Z

+				o.Expect(goroutineHeaderRe.MatchString(output)).To(o.BeTrue(),
+					"expected goroutine stacks in CRI-O dump from node %s, got:\n%s", node.Name, output)
+


⚠️ Potential issue | 🟠 Major

Avoid hard-failing inside the node loop.

Line 123 aborts the spec on the first malformed output, so later nodes are never checked. Record this as a node failure and continue, then fail once with the aggregated failedNodes report.

Suggested fix

- o.Expect(goroutineHeaderRe.MatchString(output)).To(o.BeTrue(), - "expected goroutine stacks in CRI-O dump from node %s, got:\n%s", node.Name, output) + if !goroutineHeaderRe.MatchString(output) { + failedNodes = append(failedNodes, + fmt.Sprintf("%s: expected goroutine stacks in CRI-O dump, got:\n%s", node.Name, output)) + continue + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/extended/node/crio_goroutinedump.go` around lines 123 - 125, The test currently uses o.Expect(...).To(...) inside the node loop which aborts the spec on the first malformed CRI-O dump; instead, introduce a failedNodes collection (e.g., []string failedNodes) declared before iterating nodes, replace the immediate o.Expect(goroutineHeaderRe.MatchString(output)).To(...) check inside the loop with an if !goroutineHeaderRe.MatchString(output) branch that appends a descriptive failure message (including node.Name and output) to failedNodes and continues, and after the node loop add a single assertion (or o.Expect(len(failedNodes)).To(o.Equal(0))) that fails the spec if failedNodes is non-empty and includes the aggregated failedNodes report in the failure message.

coderabbitai · 2026-04-16T00:36:27Z

+			// Always fail so goroutine dumps are visible in test results.
+			// The test is marked Informing so this won't block merges.
+			if len(dumps) > 0 {
+				g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n")))
+			}


⚠️ Potential issue | 🟠 Major

This makes the test always fail on success paths.

Line 146 forces failure whenever dumps are collected, so a healthy run can never pass. Prefer writing dumps to GinkgoWriter (or logs/artifacts) and reserve failure for actual fault conditions (failedNodes / stuckPulls).

Suggested fix

- // Always fail so goroutine dumps are visible in test results. - // The test is marked Informing so this won't block merges. - if len(dumps) > 0 { - g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n"))) - } + // Emit dumps for visibility without forcing spec failure. + if len(dumps) > 0 { + g.By("CRI-O goroutine dumps collected") + _, _ = fmt.Fprintf(g.GinkgoWriter, "%s\n", strings.Join(dumps, "\n\n")) + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/extended/node/crio_goroutinedump.go` around lines 143 - 147, The current block unconditionally calls g.Fail when dumps are present which causes successful runs to fail; modify the CRI-O goroutine dump handling in the section that references dumps and g.Fail so that dumps are written to GinkgoWriter (or test logs/artifacts) instead of failing the test, and only call g.Fail if there are actual fault conditions such as failedNodes or stuckPulls being non-empty; update the fmt.Sprintf/g.Fail call to use GinkgoWriter.Printf (or similar) to emit the joined dumps and add a conditional that preserves failing behavior only when failedNodes or stuckPulls indicate a real error.

openshift-merge-bot · 2026-04-16T00:51:20Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci · 2026-04-16T04:53:35Z

@bitoku: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-vsphere-ovn-upi	`0418318`	link	true	`/test e2e-vsphere-ovn-upi`
ci/prow/e2e-gcp-ovn	`0418318`	link	true	`/test e2e-gcp-ovn`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-trt · 2026-04-16T06:21:41Z

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 0418318

Job Name	New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-fips	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-gcp-ovn	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-vsphere-ovn	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi	High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit

New tests seen in this PR at sha: 0418318

"[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" [Total: 9, Pass: 0, Fail: 9, Flake: 0]

Signed-off-by: Ayato Tokubi <atokubi@redhat.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

test/extended/node/crio_goroutinedump.go (2)

178-182: ⚠️ Potential issue | 🟠 Major

Do not force-fail healthy runs just to emit dumps.

Line 180 makes the spec fail whenever dumps exist, so a healthy run can never pass. Emit dumps to GinkgoWriter and keep failure paths tied to real faults (failedNodes / stuckPulls).

Suggested change

-			// Always fail so goroutine dumps are visible in test results.
-			// The test is marked Informing so this won't block merges.
-			if len(dumps) > 0 {
-				g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n")))
-			}
+			// Emit dumps for visibility without forcing failure on success paths.
+			if len(dumps) > 0 {
+				g.By("CRI-O goroutine dumps collected")
+				_, _ = fmt.Fprintf(g.GinkgoWriter, "%s\n", strings.Join(dumps, "\n\n"))
+			}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 178 - 182, The test
currently calls g.Fail when any CRI-O goroutine dumps exist (variable dumps),
forcing healthy runs to fail; change this so dumps are written to GinkgoWriter
instead and only trigger g.Fail when real failure conditions remain (e.g.,
failedNodes or stuckPulls). Locate the block using dumps and g.Fail in
crio_goroutinedump.go, replace the unconditional g.Fail(fmt.Sprintf(...)) with
code that prints the joined dumps to GinkgoWriter (with context header) and
leaves existing failure logic that checks failedNodes and stuckPulls intact so
only those conditions cause test failures.

158-160: ⚠️ Potential issue | 🟠 Major

Avoid failing mid-loop; aggregate malformed dump nodes and fail once.

Line 158 aborts on the first bad dump format, so you lose complete per-node diagnostics. Append this as a node failure and continue, then assert once after the loop.

Suggested change

-				o.Expect(goroutineHeaderRe.MatchString(r.output)).To(o.BeTrue(),
-					"expected goroutine stacks in CRI-O dump from node %s, output length=%d, got:\n%s", r.name, len(r.output), r.output)
+				if !goroutineHeaderRe.MatchString(r.output) {
+					failedNodes = append(failedNodes,
+						fmt.Sprintf("%s: expected goroutine stacks in CRI-O dump, output length=%d, got:\n%s", r.name, len(r.output), r.output))
+					continue
+				}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 158 - 160, The test
currently aborts on the first malformed per-node CRI-O dump by calling
o.Expect(goroutineHeaderRe.MatchString(r.output)) inside the loop; instead,
collect failures and assert once after the loop. Create a slice (e.g.,
malformedNodes []string) before iterating, replace the inline o.Expect check in
the loop with an if !goroutineHeaderRe.MatchString(r.output) { malformedNodes =
append(malformedNodes, fmt.Sprintf("node %s: output length=%d\n%s", r.name,
len(r.output), r.output)); continue } and after the loop assert that
malformedNodes is empty (e.g., o.Expect(len(malformedNodes)).To(o.Equal(0),
"malformed CRI-O dumps:\n%s", strings.Join(malformedNodes, "\n"))), so you
accumulate diagnostics for all nodes rather than failing mid-loop.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 115-130: The test spawns a goroutine per node concurrently which
can overload the cluster/APIs; limit concurrent debug sessions by adding a
concurrency limiter (e.g., a buffered semaphore channel or worker pool) around
the loop so only a fixed number (configurable, e.g., 5) of goroutines run at
once; acquire the semaphore before launching the goroutine that calls
exutil.DebugNodeRetryWithOptionsAndChroot and release it (close/receive) in the
goroutine's defer along with wg.Done(), ensuring results[i] still gets set and
g.GinkgoRecover() remains called; reference the existing loop variables (nodes,
i, node.Name), wg, results, nodeResult, shellCmd and
exutil.DebugNodeRetryWithOptionsAndChroot when implementing the limiter.

---

Duplicate comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 178-182: The test currently calls g.Fail when any CRI-O goroutine
dumps exist (variable dumps), forcing healthy runs to fail; change this so dumps
are written to GinkgoWriter instead and only trigger g.Fail when real failure
conditions remain (e.g., failedNodes or stuckPulls). Locate the block using
dumps and g.Fail in crio_goroutinedump.go, replace the unconditional
g.Fail(fmt.Sprintf(...)) with code that prints the joined dumps to GinkgoWriter
(with context header) and leaves existing failure logic that checks failedNodes
and stuckPulls intact so only those conditions cause test failures.
- Around line 158-160: The test currently aborts on the first malformed per-node
CRI-O dump by calling o.Expect(goroutineHeaderRe.MatchString(r.output)) inside
the loop; instead, collect failures and assert once after the loop. Create a
slice (e.g., malformedNodes []string) before iterating, replace the inline
o.Expect check in the loop with an if !goroutineHeaderRe.MatchString(r.output) {
malformedNodes = append(malformedNodes, fmt.Sprintf("node %s: output
length=%d\n%s", r.name, len(r.output), r.output)); continue } and after the loop
assert that malformedNodes is empty (e.g.,
o.Expect(len(malformedNodes)).To(o.Equal(0), "malformed CRI-O dumps:\n%s",
strings.Join(malformedNodes, "\n"))), so you accumulate diagnostics for all
nodes rather than failing mid-loop.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6a35fe37-1646-4179-ab57-d6e15aa73f26

📥 Commits

Reviewing files that changed from the base of the PR and between 0418318 and f1853da.

📒 Files selected for processing (1)

test/extended/node/crio_goroutinedump.go

coderabbitai · 2026-04-16T08:44:14Z

+			for i, node := range nodes {
+				wg.Add(1)
+				go func(i int, nodeName string) {
+					defer g.GinkgoRecover()
+					defer wg.Done()
+
+					g.By(fmt.Sprintf("Sending SIGUSR1 to CRI-O on node %s", nodeName))
+
+					output, err := exutil.DebugNodeRetryWithOptionsAndChroot(
+						oc, nodeName, "default",
+						"sh", "-c", shellCmd,
+					)
+					results[i] = nodeResult{name: nodeName, output: output, err: err}
+				}(i, node.Name)
+			}
+			wg.Wait()


⚠️ Potential issue | 🟠 Major

Bound parallel debug sessions to avoid cluster/API overload.

Line 115 currently spawns one debug flow per node at once. On larger clusters this can create a burst of debug pods and induce throttling/timeouts, making this test flaky.

Suggested change

results := make([]nodeResult, len(nodes)) var wg sync.WaitGroup + sem := make(chan struct{}, 10) // tune as needed for i, node := range nodes { wg.Add(1) go func(i int, nodeName string) { defer g.GinkgoRecover() defer wg.Done() + sem <- struct{}{} + defer func() { <-sem }() g.By(fmt.Sprintf("Sending SIGUSR1 to CRI-O on node %s", nodeName))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/extended/node/crio_goroutinedump.go` around lines 115 - 130, The test spawns a goroutine per node concurrently which can overload the cluster/APIs; limit concurrent debug sessions by adding a concurrency limiter (e.g., a buffered semaphore channel or worker pool) around the loop so only a fixed number (configurable, e.g., 5) of goroutines run at once; acquire the semaphore before launching the goroutine that calls exutil.DebugNodeRetryWithOptionsAndChroot and release it (close/receive) in the goroutine's defer along with wg.Done(), ensuring results[i] still gets set and g.GinkgoRecover() remains called; reference the existing loop variables (nodes, i, node.Name), wg, results, nodeResult, shellCmd and exutil.DebugNodeRetryWithOptionsAndChroot when implementing the limiter.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 15, 2026

openshift-ci bot requested review from dgrisonnet and rphillips April 15, 2026 13:27

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

Comment thread test/extended/node/crio_goroutinedump.go Outdated

Comment thread test/extended/node/crio_goroutinedump.go

bitoku force-pushed the crio-goroutine-dump-late-test branch from 8c04a45 to 1cc9cb0 Compare April 15, 2026 14:05

bitoku force-pushed the crio-goroutine-dump-late-test branch from 1cc9cb0 to 0418318 Compare April 16, 2026 00:29

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1

f1853da

Signed-off-by: Ayato Tokubi <atokubi@redhat.com>

bitoku force-pushed the crio-goroutine-dump-late-test branch from 0418318 to f1853da Compare April 16, 2026 08:37

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

		o.Expect(goroutineHeaderRe.MatchString(output)).To(o.BeTrue(),
		"expected goroutine stacks in CRI-O dump from node %s, got:\n%s", node.Name, output)

Conversation

bitoku commented Apr 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-merge-bot bot commented Apr 15, 2026

Uh oh!

openshift-ci-robot commented Apr 15, 2026

Uh oh!

openshift-ci bot commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

bitoku commented Apr 15, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

openshift-merge-bot bot commented Apr 15, 2026

Uh oh!

openshift-trt bot commented Apr 15, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot bot commented Apr 16, 2026

Uh oh!

openshift-ci bot commented Apr 16, 2026

Uh oh!

openshift-trt bot commented Apr 16, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bitoku commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 15, 2026 •

edited

Loading