WIP: OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1#31013
WIP: OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1#31013bitoku wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@bitoku: This pull request references Jira Issue OCPBUGS-79544, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bitoku The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughAdds a new Ginkgo node test and helper that trigger CRI‑O to write goroutine dumps on each cluster node, collect and parse those dumps for long IO‑wait goroutines related to image pulls, and fail if any node has missing dumps or stuck image‑pull goroutines. Changes
Sequence DiagramsequenceDiagram
participant TestRunner as Test Runner
participant Node as Cluster Node
participant CRIO as CRI‑O Process
participant Parser as Dump Parser
TestRunner->>Node: run shell to find CRI‑O PID
Node-->>TestRunner: PID or CRIO_NOT_FOUND
TestRunner->>CRIO: send SIGUSR1 to PID
activate CRIO
CRIO->>CRIO: write /tmp/crio-goroutine-stacks-*.log
CRIO-->>TestRunner: dump file created (or timeout)
deactivate CRIO
TestRunner->>Node: fetch newest dump file
Node-->>TestRunner: goroutine dump content or DUMP_TIMEOUT
TestRunner->>Parser: call findStuckImagePulls(dump)
Parser->>Parser: split into goroutine blocks
Parser->>Parser: extract IO wait minutes, filter > 30m
Parser->>Parser: match stack pattern (net.(*conn).Read -> image/v5/docker.(*bodyReader).Read)
Parser-->>TestRunner: list of stuck goroutines
alt stuck goroutines found
TestRunner->>TestRunner: fail test and emit dumps
else no stuck goroutines
TestRunner->>TestRunner: pass
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
/test |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 84-88: The shellCmd currently grabs the newest
/tmp/crio-goroutine-stacks-*.log which can be a stale file or empty if none
exists; update the logic in the shellCmd string in crio_goroutinedump.go to
record the list (or latest mtime) of existing matching files before sending kill
-USR1, then after signalling poll/wait (with timeout) for a new file whose mtime
is greater than the pre-signal timestamp (or not in the pre-signal list), and
only then cat that new file; ensure the command returns a non-zero exit and
clear error message if no new dump appears within the timeout to avoid silent
bad reads.
- Around line 90-124: The loop currently treats missing CRI-O dumps as warnings
and continues, allowing partial node coverage to pass; modify the logic in the
loop that calls exutil.DebugNodeRetryWithOptionsAndChroot (the block handling
"CRIO_NOT_FOUND" and errors) to instead record failures (e.g., collect node
names into a missingDumps slice) and after the loop assert that missingDumps is
empty (failing the test with a clear message), or alternatively pre-filter the
nodes slice to only include nodes expected to run CRI-O before invoking
exutil.DebugNodeRetryWithOptionsAndChroot; update references to nodes, dumps,
stuckPulls, goroutineHeaderRe and findStuckImagePulls accordingly so
missing/filtered nodes are handled deterministically and the test fails if any
expected node did not produce a dump.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: f0f2719c-5fb3-41e3-b81a-83372c196fdc
📒 Files selected for processing (1)
test/extended/node/crio_goroutinedump.go
8c04a45 to
1cc9cb0
Compare
|
Scheduling required tests: |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 1cc9cb0
New tests seen in this PR at sha: 1cc9cb0
|
1cc9cb0 to
0418318
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 86-93: The MARKER temporary file created by MARKER=$(mktemp) can
leak if the script exits early (e.g., kill fails); add a shell trap immediately
after creating MARKER to ensure it's removed on all exit paths (trap 'rm -f
"$MARKER"' EXIT) and optionally on INT and TERM, then proceed with the existing
kill -USR1 $CRIO_PID and wait loop; update the block containing MARKER, kill
-USR1 $CRIO_PID and the find/seq loop to rely on that trap so the marker is
always cleaned up even on early failure.
- Around line 143-147: The current block unconditionally calls g.Fail when dumps
are present which causes successful runs to fail; modify the CRI-O goroutine
dump handling in the section that references dumps and g.Fail so that dumps are
written to GinkgoWriter (or test logs/artifacts) instead of failing the test,
and only call g.Fail if there are actual fault conditions such as failedNodes or
stuckPulls being non-empty; update the fmt.Sprintf/g.Fail call to use
GinkgoWriter.Printf (or similar) to emit the joined dumps and add a conditional
that preserves failing behavior only when failedNodes or stuckPulls indicate a
real error.
- Around line 123-125: The test currently uses o.Expect(...).To(...) inside the
node loop which aborts the spec on the first malformed CRI-O dump; instead,
introduce a failedNodes collection (e.g., []string failedNodes) declared before
iterating nodes, replace the immediate
o.Expect(goroutineHeaderRe.MatchString(output)).To(...) check inside the loop
with an if !goroutineHeaderRe.MatchString(output) branch that appends a
descriptive failure message (including node.Name and output) to failedNodes and
continues, and after the node loop add a single assertion (or
o.Expect(len(failedNodes)).To(o.Equal(0))) that fails the spec if failedNodes is
non-empty and includes the aggregated failedNodes report in the failure message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 365ad2c5-fee6-4164-bdd6-d5f0aabddbc0
📒 Files selected for processing (1)
test/extended/node/crio_goroutinedump.go
| MARKER=$(mktemp) && \ | ||
| kill -USR1 $CRIO_PID && \ | ||
| for i in $(seq 1 10); do \ | ||
| NEW=$(find /tmp -maxdepth 1 -name 'crio-goroutine-stacks-*.log' -newer "$MARKER" 2>/dev/null | head -1) && \ | ||
| if [ -n "$NEW" ]; then rm -f "$MARKER"; cat "$NEW"; exit 0; fi; \ | ||
| sleep 1; \ | ||
| done && \ | ||
| rm -f "$MARKER" && echo "DUMP_TIMEOUT" && exit 1` |
There was a problem hiding this comment.
Ensure marker cleanup on every exit path.
MARKER is removed in success/timeout branches, but if the command exits early after mktemp (e.g., kill failure), it can leak files in /tmp. Add a shell trap right after marker creation.
Suggested fix
-MARKER=$(mktemp) && \
-kill -USR1 $CRIO_PID && \
+MARKER=$(mktemp) && \
+trap 'rm -f "$MARKER"' EXIT && \
+kill -USR1 $CRIO_PID && \
for i in $(seq 1 10); do \
NEW=$(find /tmp -maxdepth 1 -name 'crio-goroutine-stacks-*.log' -newer "$MARKER" 2>/dev/null | head -1) && \
- if [ -n "$NEW" ]; then rm -f "$MARKER"; cat "$NEW"; exit 0; fi; \
+ if [ -n "$NEW" ]; then cat "$NEW"; exit 0; fi; \
sleep 1; \
done && \
-rm -f "$MARKER" && echo "DUMP_TIMEOUT" && exit 1`
+echo "DUMP_TIMEOUT" && exit 1`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/crio_goroutinedump.go` around lines 86 - 93, The MARKER
temporary file created by MARKER=$(mktemp) can leak if the script exits early
(e.g., kill fails); add a shell trap immediately after creating MARKER to ensure
it's removed on all exit paths (trap 'rm -f "$MARKER"' EXIT) and optionally on
INT and TERM, then proceed with the existing kill -USR1 $CRIO_PID and wait loop;
update the block containing MARKER, kill -USR1 $CRIO_PID and the find/seq loop
to rely on that trap so the marker is always cleaned up even on early failure.
| o.Expect(goroutineHeaderRe.MatchString(output)).To(o.BeTrue(), | ||
| "expected goroutine stacks in CRI-O dump from node %s, got:\n%s", node.Name, output) | ||
|
|
There was a problem hiding this comment.
Avoid hard-failing inside the node loop.
Line 123 aborts the spec on the first malformed output, so later nodes are never checked. Record this as a node failure and continue, then fail once with the aggregated failedNodes report.
Suggested fix
- o.Expect(goroutineHeaderRe.MatchString(output)).To(o.BeTrue(),
- "expected goroutine stacks in CRI-O dump from node %s, got:\n%s", node.Name, output)
+ if !goroutineHeaderRe.MatchString(output) {
+ failedNodes = append(failedNodes,
+ fmt.Sprintf("%s: expected goroutine stacks in CRI-O dump, got:\n%s", node.Name, output))
+ continue
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/crio_goroutinedump.go` around lines 123 - 125, The test
currently uses o.Expect(...).To(...) inside the node loop which aborts the spec
on the first malformed CRI-O dump; instead, introduce a failedNodes collection
(e.g., []string failedNodes) declared before iterating nodes, replace the
immediate o.Expect(goroutineHeaderRe.MatchString(output)).To(...) check inside
the loop with an if !goroutineHeaderRe.MatchString(output) branch that appends a
descriptive failure message (including node.Name and output) to failedNodes and
continues, and after the node loop add a single assertion (or
o.Expect(len(failedNodes)).To(o.Equal(0))) that fails the spec if failedNodes is
non-empty and includes the aggregated failedNodes report in the failure message.
| // Always fail so goroutine dumps are visible in test results. | ||
| // The test is marked Informing so this won't block merges. | ||
| if len(dumps) > 0 { | ||
| g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n"))) | ||
| } |
There was a problem hiding this comment.
This makes the test always fail on success paths.
Line 146 forces failure whenever dumps are collected, so a healthy run can never pass. Prefer writing dumps to GinkgoWriter (or logs/artifacts) and reserve failure for actual fault conditions (failedNodes / stuckPulls).
Suggested fix
- // Always fail so goroutine dumps are visible in test results.
- // The test is marked Informing so this won't block merges.
- if len(dumps) > 0 {
- g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n")))
- }
+ // Emit dumps for visibility without forcing spec failure.
+ if len(dumps) > 0 {
+ g.By("CRI-O goroutine dumps collected")
+ _, _ = fmt.Fprintf(g.GinkgoWriter, "%s\n", strings.Join(dumps, "\n\n"))
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/crio_goroutinedump.go` around lines 143 - 147, The current
block unconditionally calls g.Fail when dumps are present which causes
successful runs to fail; modify the CRI-O goroutine dump handling in the section
that references dumps and g.Fail so that dumps are written to GinkgoWriter (or
test logs/artifacts) instead of failing the test, and only call g.Fail if there
are actual fault conditions such as failedNodes or stuckPulls being non-empty;
update the fmt.Sprintf/g.Fail call to use GinkgoWriter.Printf (or similar) to
emit the joined dumps and add a conditional that preserves failing behavior only
when failedNodes or stuckPulls indicate a real error.
|
Scheduling required tests: |
|
@bitoku: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 0418318
New tests seen in this PR at sha: 0418318
|
Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
0418318 to
f1853da
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
test/extended/node/crio_goroutinedump.go (2)
178-182:⚠️ Potential issue | 🟠 MajorDo not force-fail healthy runs just to emit dumps.
Line 180 makes the spec fail whenever dumps exist, so a healthy run can never pass. Emit dumps to
GinkgoWriterand keep failure paths tied to real faults (failedNodes/stuckPulls).Suggested change
- // Always fail so goroutine dumps are visible in test results. - // The test is marked Informing so this won't block merges. - if len(dumps) > 0 { - g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n"))) - } + // Emit dumps for visibility without forcing failure on success paths. + if len(dumps) > 0 { + g.By("CRI-O goroutine dumps collected") + _, _ = fmt.Fprintf(g.GinkgoWriter, "%s\n", strings.Join(dumps, "\n\n")) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/crio_goroutinedump.go` around lines 178 - 182, The test currently calls g.Fail when any CRI-O goroutine dumps exist (variable dumps), forcing healthy runs to fail; change this so dumps are written to GinkgoWriter instead and only trigger g.Fail when real failure conditions remain (e.g., failedNodes or stuckPulls). Locate the block using dumps and g.Fail in crio_goroutinedump.go, replace the unconditional g.Fail(fmt.Sprintf(...)) with code that prints the joined dumps to GinkgoWriter (with context header) and leaves existing failure logic that checks failedNodes and stuckPulls intact so only those conditions cause test failures.
158-160:⚠️ Potential issue | 🟠 MajorAvoid failing mid-loop; aggregate malformed dump nodes and fail once.
Line 158 aborts on the first bad dump format, so you lose complete per-node diagnostics. Append this as a node failure and continue, then assert once after the loop.
Suggested change
- o.Expect(goroutineHeaderRe.MatchString(r.output)).To(o.BeTrue(), - "expected goroutine stacks in CRI-O dump from node %s, output length=%d, got:\n%s", r.name, len(r.output), r.output) + if !goroutineHeaderRe.MatchString(r.output) { + failedNodes = append(failedNodes, + fmt.Sprintf("%s: expected goroutine stacks in CRI-O dump, output length=%d, got:\n%s", r.name, len(r.output), r.output)) + continue + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/crio_goroutinedump.go` around lines 158 - 160, The test currently aborts on the first malformed per-node CRI-O dump by calling o.Expect(goroutineHeaderRe.MatchString(r.output)) inside the loop; instead, collect failures and assert once after the loop. Create a slice (e.g., malformedNodes []string) before iterating, replace the inline o.Expect check in the loop with an if !goroutineHeaderRe.MatchString(r.output) { malformedNodes = append(malformedNodes, fmt.Sprintf("node %s: output length=%d\n%s", r.name, len(r.output), r.output)); continue } and after the loop assert that malformedNodes is empty (e.g., o.Expect(len(malformedNodes)).To(o.Equal(0), "malformed CRI-O dumps:\n%s", strings.Join(malformedNodes, "\n"))), so you accumulate diagnostics for all nodes rather than failing mid-loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 115-130: The test spawns a goroutine per node concurrently which
can overload the cluster/APIs; limit concurrent debug sessions by adding a
concurrency limiter (e.g., a buffered semaphore channel or worker pool) around
the loop so only a fixed number (configurable, e.g., 5) of goroutines run at
once; acquire the semaphore before launching the goroutine that calls
exutil.DebugNodeRetryWithOptionsAndChroot and release it (close/receive) in the
goroutine's defer along with wg.Done(), ensuring results[i] still gets set and
g.GinkgoRecover() remains called; reference the existing loop variables (nodes,
i, node.Name), wg, results, nodeResult, shellCmd and
exutil.DebugNodeRetryWithOptionsAndChroot when implementing the limiter.
---
Duplicate comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 178-182: The test currently calls g.Fail when any CRI-O goroutine
dumps exist (variable dumps), forcing healthy runs to fail; change this so dumps
are written to GinkgoWriter instead and only trigger g.Fail when real failure
conditions remain (e.g., failedNodes or stuckPulls). Locate the block using
dumps and g.Fail in crio_goroutinedump.go, replace the unconditional
g.Fail(fmt.Sprintf(...)) with code that prints the joined dumps to GinkgoWriter
(with context header) and leaves existing failure logic that checks failedNodes
and stuckPulls intact so only those conditions cause test failures.
- Around line 158-160: The test currently aborts on the first malformed per-node
CRI-O dump by calling o.Expect(goroutineHeaderRe.MatchString(r.output)) inside
the loop; instead, collect failures and assert once after the loop. Create a
slice (e.g., malformedNodes []string) before iterating, replace the inline
o.Expect check in the loop with an if !goroutineHeaderRe.MatchString(r.output) {
malformedNodes = append(malformedNodes, fmt.Sprintf("node %s: output
length=%d\n%s", r.name, len(r.output), r.output)); continue } and after the loop
assert that malformedNodes is empty (e.g.,
o.Expect(len(malformedNodes)).To(o.Equal(0), "malformed CRI-O dumps:\n%s",
strings.Join(malformedNodes, "\n"))), so you accumulate diagnostics for all
nodes rather than failing mid-loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 6a35fe37-1646-4179-ab57-d6e15aa73f26
📒 Files selected for processing (1)
test/extended/node/crio_goroutinedump.go
| for i, node := range nodes { | ||
| wg.Add(1) | ||
| go func(i int, nodeName string) { | ||
| defer g.GinkgoRecover() | ||
| defer wg.Done() | ||
|
|
||
| g.By(fmt.Sprintf("Sending SIGUSR1 to CRI-O on node %s", nodeName)) | ||
|
|
||
| output, err := exutil.DebugNodeRetryWithOptionsAndChroot( | ||
| oc, nodeName, "default", | ||
| "sh", "-c", shellCmd, | ||
| ) | ||
| results[i] = nodeResult{name: nodeName, output: output, err: err} | ||
| }(i, node.Name) | ||
| } | ||
| wg.Wait() |
There was a problem hiding this comment.
Bound parallel debug sessions to avoid cluster/API overload.
Line 115 currently spawns one debug flow per node at once. On larger clusters this can create a burst of debug pods and induce throttling/timeouts, making this test flaky.
Suggested change
results := make([]nodeResult, len(nodes))
var wg sync.WaitGroup
+ sem := make(chan struct{}, 10) // tune as needed
for i, node := range nodes {
wg.Add(1)
go func(i int, nodeName string) {
defer g.GinkgoRecover()
defer wg.Done()
+ sem <- struct{}{}
+ defer func() { <-sem }()
g.By(fmt.Sprintf("Sending SIGUSR1 to CRI-O on node %s", nodeName))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/crio_goroutinedump.go` around lines 115 - 130, The test
spawns a goroutine per node concurrently which can overload the cluster/APIs;
limit concurrent debug sessions by adding a concurrency limiter (e.g., a
buffered semaphore channel or worker pool) around the loop so only a fixed
number (configurable, e.g., 5) of goroutines run at once; acquire the semaphore
before launching the goroutine that calls
exutil.DebugNodeRetryWithOptionsAndChroot and release it (close/receive) in the
goroutine's defer along with wg.Done(), ensuring results[i] still gets set and
g.GinkgoRecover() remains called; reference the existing loop variables (nodes,
i, node.Name), wg, results, nodeResult, shellCmd and
exutil.DebugNodeRetryWithOptionsAndChroot when implementing the limiter.
This test runs after all other tests to capture CRI-O's goroutine state on every node, providing visibility into stuck operations after the suite has exercised CRI-O under load.
The test sends SIGUSR1 to CRI-O on each node, reads the resulting goroutine dump from /tmp/crio-goroutine-stacks-*.log, and fails hard if any goroutine is stuck in an image pull (IO wait > 30 min in docker.(*bodyReader).Read -> net.(*conn).Read).
Summary by CodeRabbit