Skip to content

WIP: OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1#31013

Open
bitoku wants to merge 1 commit intoopenshift:mainfrom
bitoku:crio-goroutine-dump-late-test
Open

WIP: OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1#31013
bitoku wants to merge 1 commit intoopenshift:mainfrom
bitoku:crio-goroutine-dump-late-test

Conversation

@bitoku
Copy link
Copy Markdown
Contributor

@bitoku bitoku commented Apr 15, 2026

This test runs after all other tests to capture CRI-O's goroutine state on every node, providing visibility into stuck operations after the suite has exercised CRI-O under load.

The test sends SIGUSR1 to CRI-O on each node, reads the resulting goroutine dump from /tmp/crio-goroutine-stacks-*.log, and fails hard if any goroutine is stuck in an image pull (IO wait > 30 min in docker.(*bodyReader).Read -> net.(*conn).Read).

Summary by CodeRabbit

  • Tests
    • Added a diagnostic node-level test that collects container runtime goroutine dumps from each node, detects stuck image-pull goroutines with prolonged I/O wait, and fails if nodes do not produce dumps or if stuck goroutines are found. Collected dump output is included in test results to aid investigation; the test is non-blocking/informing.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 15, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bitoku: This pull request references Jira Issue OCPBUGS-79544, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This test runs after all other tests to capture CRI-O's goroutine state on every node, providing visibility into stuck operations after the suite has exercised CRI-O under load.

The test sends SIGUSR1 to CRI-O on each node, reads the resulting goroutine dump from /tmp/crio-goroutine-stacks-*.log, and fails hard if any goroutine is stuck in an image pull (IO wait > 30 min in docker.(*bodyReader).Read -> net.(*conn).Read).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from dgrisonnet and rphillips April 15, 2026 13:27
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 15, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bitoku
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 15, 2026

Walkthrough

Adds a new Ginkgo node test and helper that trigger CRI‑O to write goroutine dumps on each cluster node, collect and parse those dumps for long IO‑wait goroutines related to image pulls, and fail if any node has missing dumps or stuck image‑pull goroutines.

Changes

Cohort / File(s) Summary
CRI-O Goroutine Dump Test
test/extended/node/crio_goroutinedump.go
New test file. Adds findStuckImagePulls(dump string) []string which splits CRI‑O goroutine dumps into goroutine blocks, extracts IO wait durations, filters for IO wait > 30 minutes, and matches stacks where net.(*conn).Read appears before image/v5/docker.(*bodyReader).Read. Adds a Ginkgo [sig-node][Late] test that finds CRI‑O PID on each node, sends SIGUSR1, waits for /tmp/crio-goroutine-stacks-*.log, retrieves dumps (handling PrivateTmp), aggregates per-node results and failure reasons, asserts every node produced a dump, and fails if any stuck image‑pull goroutines are detected (emitting collected dumps).

Sequence Diagram

sequenceDiagram
    participant TestRunner as Test Runner
    participant Node as Cluster Node
    participant CRIO as CRI‑O Process
    participant Parser as Dump Parser

    TestRunner->>Node: run shell to find CRI‑O PID
    Node-->>TestRunner: PID or CRIO_NOT_FOUND
    TestRunner->>CRIO: send SIGUSR1 to PID
    activate CRIO
    CRIO->>CRIO: write /tmp/crio-goroutine-stacks-*.log
    CRIO-->>TestRunner: dump file created (or timeout)
    deactivate CRIO
    TestRunner->>Node: fetch newest dump file
    Node-->>TestRunner: goroutine dump content or DUMP_TIMEOUT
    TestRunner->>Parser: call findStuckImagePulls(dump)
    Parser->>Parser: split into goroutine blocks
    Parser->>Parser: extract IO wait minutes, filter > 30m
    Parser->>Parser: match stack pattern (net.(*conn).Read -> image/v5/docker.(*bodyReader).Read)
    Parser-->>TestRunner: list of stuck goroutines
    alt stuck goroutines found
        TestRunner->>TestRunner: fail test and emit dumps
    else no stuck goroutines
        TestRunner->>TestRunner: pass
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Test fails unbounded concurrency requirements: spawns debug pods for all nodes simultaneously without semaphore limiting (lines 115-126), acknowledged in review comments as causing flakiness on larger clusters. Also lacks error message on assertion at line 80. Add buffered semaphore channel to limit concurrent debug pods (e.g., sem := make(chan struct{}, 10)), add failure message to GetAllClusterNodes assertion, and wrap test logic in context with timeout to prevent indefinite waits.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a [Late] test for collecting CRI-O goroutine dumps via SIGUSR1, matching the actual changeset which introduces this test.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Stable And Deterministic Test Names ✅ Passed The test names '[sig-node][Late]' and 'CRI-O should report goroutine stacks on all nodes' are static and deterministic with no dynamic content.
Microshift Test Compatibility ✅ Passed The test crio_goroutinedump.go uses only standard Kubernetes APIs and kubectl debug commands compatible with MicroShift, with no OpenShift-specific dependencies or HA assumptions.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Test uses GetAllClusterNodes() without multi-node assertions and functions correctly on SNO with one node.
Topology-Aware Scheduling Compatibility ✅ Passed Test adds a Ginkgo test file that dynamically enumerates all cluster nodes and runs debug pods without assuming any specific HA topology or introducing scheduling constraints.
Ote Binary Stdout Contract ✅ Passed The test file contains no direct stdout writes. All output operations use Ginkgo/Gomega framework functions (g.By, g.Fail, o.Expect) and fmt.Sprintf for string construction only.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Test file contains no IPv4 assumptions, hardcoded IP addresses, or external connectivity requirements. All operations are cluster-internal and node-local.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 15, 2026

/test

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 84-88: The shellCmd currently grabs the newest
/tmp/crio-goroutine-stacks-*.log which can be a stale file or empty if none
exists; update the logic in the shellCmd string in crio_goroutinedump.go to
record the list (or latest mtime) of existing matching files before sending kill
-USR1, then after signalling poll/wait (with timeout) for a new file whose mtime
is greater than the pre-signal timestamp (or not in the pre-signal list), and
only then cat that new file; ensure the command returns a non-zero exit and
clear error message if no new dump appears within the timeout to avoid silent
bad reads.
- Around line 90-124: The loop currently treats missing CRI-O dumps as warnings
and continues, allowing partial node coverage to pass; modify the logic in the
loop that calls exutil.DebugNodeRetryWithOptionsAndChroot (the block handling
"CRIO_NOT_FOUND" and errors) to instead record failures (e.g., collect node
names into a missingDumps slice) and after the loop assert that missingDumps is
empty (failing the test with a clear message), or alternatively pre-filter the
nodes slice to only include nodes expected to run CRI-O before invoking
exutil.DebugNodeRetryWithOptionsAndChroot; update references to nodes, dumps,
stuckPulls, goroutineHeaderRe and findStuckImagePulls accordingly so
missing/filtered nodes are handled deterministically and the test fails if any
expected node did not produce a dump.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f0f2719c-5fb3-41e3-b81a-83372c196fdc

📥 Commits

Reviewing files that changed from the base of the PR and between 9df27cd and 8c04a45.

📒 Files selected for processing (1)
  • test/extended/node/crio_goroutinedump.go

Comment thread test/extended/node/crio_goroutinedump.go Outdated
Comment thread test/extended/node/crio_goroutinedump.go
@bitoku bitoku force-pushed the crio-goroutine-dump-late-test branch from 8c04a45 to 1cc9cb0 Compare April 15, 2026 14:05
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-trt
Copy link
Copy Markdown

openshift-trt bot commented Apr 15, 2026

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 1cc9cb0

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit

New tests seen in this PR at sha: 1cc9cb0

  • "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" [Total: 9, Pass: 7, Fail: 2, Flake: 0]

@bitoku bitoku force-pushed the crio-goroutine-dump-late-test branch from 1cc9cb0 to 0418318 Compare April 16, 2026 00:29
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 86-93: The MARKER temporary file created by MARKER=$(mktemp) can
leak if the script exits early (e.g., kill fails); add a shell trap immediately
after creating MARKER to ensure it's removed on all exit paths (trap 'rm -f
"$MARKER"' EXIT) and optionally on INT and TERM, then proceed with the existing
kill -USR1 $CRIO_PID and wait loop; update the block containing MARKER, kill
-USR1 $CRIO_PID and the find/seq loop to rely on that trap so the marker is
always cleaned up even on early failure.
- Around line 143-147: The current block unconditionally calls g.Fail when dumps
are present which causes successful runs to fail; modify the CRI-O goroutine
dump handling in the section that references dumps and g.Fail so that dumps are
written to GinkgoWriter (or test logs/artifacts) instead of failing the test,
and only call g.Fail if there are actual fault conditions such as failedNodes or
stuckPulls being non-empty; update the fmt.Sprintf/g.Fail call to use
GinkgoWriter.Printf (or similar) to emit the joined dumps and add a conditional
that preserves failing behavior only when failedNodes or stuckPulls indicate a
real error.
- Around line 123-125: The test currently uses o.Expect(...).To(...) inside the
node loop which aborts the spec on the first malformed CRI-O dump; instead,
introduce a failedNodes collection (e.g., []string failedNodes) declared before
iterating nodes, replace the immediate
o.Expect(goroutineHeaderRe.MatchString(output)).To(...) check inside the loop
with an if !goroutineHeaderRe.MatchString(output) branch that appends a
descriptive failure message (including node.Name and output) to failedNodes and
continues, and after the node loop add a single assertion (or
o.Expect(len(failedNodes)).To(o.Equal(0))) that fails the spec if failedNodes is
non-empty and includes the aggregated failedNodes report in the failure message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 365ad2c5-fee6-4164-bdd6-d5f0aabddbc0

📥 Commits

Reviewing files that changed from the base of the PR and between 1cc9cb0 and 0418318.

📒 Files selected for processing (1)
  • test/extended/node/crio_goroutinedump.go

Comment on lines +86 to +93
MARKER=$(mktemp) && \
kill -USR1 $CRIO_PID && \
for i in $(seq 1 10); do \
NEW=$(find /tmp -maxdepth 1 -name 'crio-goroutine-stacks-*.log' -newer "$MARKER" 2>/dev/null | head -1) && \
if [ -n "$NEW" ]; then rm -f "$MARKER"; cat "$NEW"; exit 0; fi; \
sleep 1; \
done && \
rm -f "$MARKER" && echo "DUMP_TIMEOUT" && exit 1`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Ensure marker cleanup on every exit path.

MARKER is removed in success/timeout branches, but if the command exits early after mktemp (e.g., kill failure), it can leak files in /tmp. Add a shell trap right after marker creation.

Suggested fix
-MARKER=$(mktemp) && \
-kill -USR1 $CRIO_PID && \
+MARKER=$(mktemp) && \
+trap 'rm -f "$MARKER"' EXIT && \
+kill -USR1 $CRIO_PID && \
 for i in $(seq 1 10); do \
   NEW=$(find /tmp -maxdepth 1 -name 'crio-goroutine-stacks-*.log' -newer "$MARKER" 2>/dev/null | head -1) && \
-  if [ -n "$NEW" ]; then rm -f "$MARKER"; cat "$NEW"; exit 0; fi; \
+  if [ -n "$NEW" ]; then cat "$NEW"; exit 0; fi; \
   sleep 1; \
 done && \
-rm -f "$MARKER" && echo "DUMP_TIMEOUT" && exit 1`
+echo "DUMP_TIMEOUT" && exit 1`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 86 - 93, The MARKER
temporary file created by MARKER=$(mktemp) can leak if the script exits early
(e.g., kill fails); add a shell trap immediately after creating MARKER to ensure
it's removed on all exit paths (trap 'rm -f "$MARKER"' EXIT) and optionally on
INT and TERM, then proceed with the existing kill -USR1 $CRIO_PID and wait loop;
update the block containing MARKER, kill -USR1 $CRIO_PID and the find/seq loop
to rely on that trap so the marker is always cleaned up even on early failure.

Comment on lines +123 to +125
o.Expect(goroutineHeaderRe.MatchString(output)).To(o.BeTrue(),
"expected goroutine stacks in CRI-O dump from node %s, got:\n%s", node.Name, output)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid hard-failing inside the node loop.

Line 123 aborts the spec on the first malformed output, so later nodes are never checked. Record this as a node failure and continue, then fail once with the aggregated failedNodes report.

Suggested fix
-				o.Expect(goroutineHeaderRe.MatchString(output)).To(o.BeTrue(),
-					"expected goroutine stacks in CRI-O dump from node %s, got:\n%s", node.Name, output)
+				if !goroutineHeaderRe.MatchString(output) {
+					failedNodes = append(failedNodes,
+						fmt.Sprintf("%s: expected goroutine stacks in CRI-O dump, got:\n%s", node.Name, output))
+					continue
+				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 123 - 125, The test
currently uses o.Expect(...).To(...) inside the node loop which aborts the spec
on the first malformed CRI-O dump; instead, introduce a failedNodes collection
(e.g., []string failedNodes) declared before iterating nodes, replace the
immediate o.Expect(goroutineHeaderRe.MatchString(output)).To(...) check inside
the loop with an if !goroutineHeaderRe.MatchString(output) branch that appends a
descriptive failure message (including node.Name and output) to failedNodes and
continues, and after the node loop add a single assertion (or
o.Expect(len(failedNodes)).To(o.Equal(0))) that fails the spec if failedNodes is
non-empty and includes the aggregated failedNodes report in the failure message.

Comment on lines +143 to +147
// Always fail so goroutine dumps are visible in test results.
// The test is marked Informing so this won't block merges.
if len(dumps) > 0 {
g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n")))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This makes the test always fail on success paths.

Line 146 forces failure whenever dumps are collected, so a healthy run can never pass. Prefer writing dumps to GinkgoWriter (or logs/artifacts) and reserve failure for actual fault conditions (failedNodes / stuckPulls).

Suggested fix
-			// Always fail so goroutine dumps are visible in test results.
-			// The test is marked Informing so this won't block merges.
-			if len(dumps) > 0 {
-				g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n")))
-			}
+			// Emit dumps for visibility without forcing spec failure.
+			if len(dumps) > 0 {
+				g.By("CRI-O goroutine dumps collected")
+				_, _ = fmt.Fprintf(g.GinkgoWriter, "%s\n", strings.Join(dumps, "\n\n"))
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 143 - 147, The current
block unconditionally calls g.Fail when dumps are present which causes
successful runs to fail; modify the CRI-O goroutine dump handling in the section
that references dumps and g.Fail so that dumps are written to GinkgoWriter (or
test logs/artifacts) instead of failing the test, and only call g.Fail if there
are actual fault conditions such as failedNodes or stuckPulls being non-empty;
update the fmt.Sprintf/g.Fail call to use GinkgoWriter.Printf (or similar) to
emit the joined dumps and add a conditional that preserves failing behavior only
when failedNodes or stuckPulls indicate a real error.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 16, 2026

@bitoku: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-upi 0418318 link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-gcp-ovn 0418318 link true /test e2e-gcp-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt
Copy link
Copy Markdown

openshift-trt bot commented Apr 16, 2026

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 0418318

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-fips High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2 High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2 High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-gcp-ovn High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6 High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-vsphere-ovn High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit

New tests seen in this PR at sha: 0418318

  • "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" [Total: 9, Pass: 0, Fail: 9, Flake: 0]

Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
@bitoku bitoku force-pushed the crio-goroutine-dump-late-test branch from 0418318 to f1853da Compare April 16, 2026 08:37
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
test/extended/node/crio_goroutinedump.go (2)

178-182: ⚠️ Potential issue | 🟠 Major

Do not force-fail healthy runs just to emit dumps.

Line 180 makes the spec fail whenever dumps exist, so a healthy run can never pass. Emit dumps to GinkgoWriter and keep failure paths tied to real faults (failedNodes / stuckPulls).

Suggested change
-			// Always fail so goroutine dumps are visible in test results.
-			// The test is marked Informing so this won't block merges.
-			if len(dumps) > 0 {
-				g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n")))
-			}
+			// Emit dumps for visibility without forcing failure on success paths.
+			if len(dumps) > 0 {
+				g.By("CRI-O goroutine dumps collected")
+				_, _ = fmt.Fprintf(g.GinkgoWriter, "%s\n", strings.Join(dumps, "\n\n"))
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 178 - 182, The test
currently calls g.Fail when any CRI-O goroutine dumps exist (variable dumps),
forcing healthy runs to fail; change this so dumps are written to GinkgoWriter
instead and only trigger g.Fail when real failure conditions remain (e.g.,
failedNodes or stuckPulls). Locate the block using dumps and g.Fail in
crio_goroutinedump.go, replace the unconditional g.Fail(fmt.Sprintf(...)) with
code that prints the joined dumps to GinkgoWriter (with context header) and
leaves existing failure logic that checks failedNodes and stuckPulls intact so
only those conditions cause test failures.

158-160: ⚠️ Potential issue | 🟠 Major

Avoid failing mid-loop; aggregate malformed dump nodes and fail once.

Line 158 aborts on the first bad dump format, so you lose complete per-node diagnostics. Append this as a node failure and continue, then assert once after the loop.

Suggested change
-				o.Expect(goroutineHeaderRe.MatchString(r.output)).To(o.BeTrue(),
-					"expected goroutine stacks in CRI-O dump from node %s, output length=%d, got:\n%s", r.name, len(r.output), r.output)
+				if !goroutineHeaderRe.MatchString(r.output) {
+					failedNodes = append(failedNodes,
+						fmt.Sprintf("%s: expected goroutine stacks in CRI-O dump, output length=%d, got:\n%s", r.name, len(r.output), r.output))
+					continue
+				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 158 - 160, The test
currently aborts on the first malformed per-node CRI-O dump by calling
o.Expect(goroutineHeaderRe.MatchString(r.output)) inside the loop; instead,
collect failures and assert once after the loop. Create a slice (e.g.,
malformedNodes []string) before iterating, replace the inline o.Expect check in
the loop with an if !goroutineHeaderRe.MatchString(r.output) { malformedNodes =
append(malformedNodes, fmt.Sprintf("node %s: output length=%d\n%s", r.name,
len(r.output), r.output)); continue } and after the loop assert that
malformedNodes is empty (e.g., o.Expect(len(malformedNodes)).To(o.Equal(0),
"malformed CRI-O dumps:\n%s", strings.Join(malformedNodes, "\n"))), so you
accumulate diagnostics for all nodes rather than failing mid-loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 115-130: The test spawns a goroutine per node concurrently which
can overload the cluster/APIs; limit concurrent debug sessions by adding a
concurrency limiter (e.g., a buffered semaphore channel or worker pool) around
the loop so only a fixed number (configurable, e.g., 5) of goroutines run at
once; acquire the semaphore before launching the goroutine that calls
exutil.DebugNodeRetryWithOptionsAndChroot and release it (close/receive) in the
goroutine's defer along with wg.Done(), ensuring results[i] still gets set and
g.GinkgoRecover() remains called; reference the existing loop variables (nodes,
i, node.Name), wg, results, nodeResult, shellCmd and
exutil.DebugNodeRetryWithOptionsAndChroot when implementing the limiter.

---

Duplicate comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 178-182: The test currently calls g.Fail when any CRI-O goroutine
dumps exist (variable dumps), forcing healthy runs to fail; change this so dumps
are written to GinkgoWriter instead and only trigger g.Fail when real failure
conditions remain (e.g., failedNodes or stuckPulls). Locate the block using
dumps and g.Fail in crio_goroutinedump.go, replace the unconditional
g.Fail(fmt.Sprintf(...)) with code that prints the joined dumps to GinkgoWriter
(with context header) and leaves existing failure logic that checks failedNodes
and stuckPulls intact so only those conditions cause test failures.
- Around line 158-160: The test currently aborts on the first malformed per-node
CRI-O dump by calling o.Expect(goroutineHeaderRe.MatchString(r.output)) inside
the loop; instead, collect failures and assert once after the loop. Create a
slice (e.g., malformedNodes []string) before iterating, replace the inline
o.Expect check in the loop with an if !goroutineHeaderRe.MatchString(r.output) {
malformedNodes = append(malformedNodes, fmt.Sprintf("node %s: output
length=%d\n%s", r.name, len(r.output), r.output)); continue } and after the loop
assert that malformedNodes is empty (e.g.,
o.Expect(len(malformedNodes)).To(o.Equal(0), "malformed CRI-O dumps:\n%s",
strings.Join(malformedNodes, "\n"))), so you accumulate diagnostics for all
nodes rather than failing mid-loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6a35fe37-1646-4179-ab57-d6e15aa73f26

📥 Commits

Reviewing files that changed from the base of the PR and between 0418318 and f1853da.

📒 Files selected for processing (1)
  • test/extended/node/crio_goroutinedump.go

Comment on lines +115 to +130
for i, node := range nodes {
wg.Add(1)
go func(i int, nodeName string) {
defer g.GinkgoRecover()
defer wg.Done()

g.By(fmt.Sprintf("Sending SIGUSR1 to CRI-O on node %s", nodeName))

output, err := exutil.DebugNodeRetryWithOptionsAndChroot(
oc, nodeName, "default",
"sh", "-c", shellCmd,
)
results[i] = nodeResult{name: nodeName, output: output, err: err}
}(i, node.Name)
}
wg.Wait()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound parallel debug sessions to avoid cluster/API overload.

Line 115 currently spawns one debug flow per node at once. On larger clusters this can create a burst of debug pods and induce throttling/timeouts, making this test flaky.

Suggested change
 			results := make([]nodeResult, len(nodes))
 			var wg sync.WaitGroup
+			sem := make(chan struct{}, 10) // tune as needed
 			for i, node := range nodes {
 				wg.Add(1)
 				go func(i int, nodeName string) {
 					defer g.GinkgoRecover()
 					defer wg.Done()
+					sem <- struct{}{}
+					defer func() { <-sem }()
 
 					g.By(fmt.Sprintf("Sending SIGUSR1 to CRI-O on node %s", nodeName))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 115 - 130, The test
spawns a goroutine per node concurrently which can overload the cluster/APIs;
limit concurrent debug sessions by adding a concurrency limiter (e.g., a
buffered semaphore channel or worker pool) around the loop so only a fixed
number (configurable, e.g., 5) of goroutines run at once; acquire the semaphore
before launching the goroutine that calls
exutil.DebugNodeRetryWithOptionsAndChroot and release it (close/receive) in the
goroutine's defer along with wg.Done(), ensuring results[i] still gets set and
g.GinkgoRecover() remains called; reference the existing loop variables (nodes,
i, node.Name), wg, results, nodeResult, shellCmd and
exutil.DebugNodeRetryWithOptionsAndChroot when implementing the limiter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants