Skip to content

fix(test): eliminate race in TestCheckIdleTaskList/Active_adding_task#7891

Open
officialasishkumar wants to merge 3 commits intocadence-workflow:masterfrom
officialasishkumar:fix/flaky-idle-tasklist-check-test
Open

fix(test): eliminate race in TestCheckIdleTaskList/Active_adding_task#7891
officialasishkumar wants to merge 3 commits intocadence-workflow:masterfrom
officialasishkumar:fix/flaky-idle-tasklist-check-test

Conversation

@officialasishkumar
Copy link
Copy Markdown
Contributor

What changed?

Increased the idle-check interval in TestCheckIdleTaskList from 10ms to 100ms and scaled all sleep durations proportionally. Closes #7867.

Why?

The test used a 10ms IdleTasklistCheckInterval with an 8ms sleep before calling AddTask. The liveness checker fires at TTL/2 = 5ms intervals, so the second check at ~10ms raced against AddTask.MarkAlive() with only a ~2ms window. Under CI load, MarkAlive() often lost the race, causing the task list to shut down before the task could be added, failing with task list shutting down.

The timing analysis:

  • Before: 10ms TTL, liveness checks at 5/10/15ms. Sleep 8ms then call AddTask which must reach MarkAlive() within ~2ms before the 10ms check fires. Under heavy CI load, this 2ms window is insufficient.
  • After: 100ms TTL, liveness checks at 50/100/150ms. Sleep 80ms then call AddTask which now has ~20ms before the 100ms check fires — 10× the previous margin.

The test logic, coverage ratios, and assertions remain identical; only the absolute timing values have changed.

How did you test it?

go test -v -run TestCheckIdleTaskList -count=5 ./service/matching/tasklist/...
go test -race -v -run TestCheckIdleTaskList -count=3 ./service/matching/tasklist/...

All subtests (Idle_task-list, Active_poll-er, Active_adding_task) passed on all 5 iterations without the race detector and 3 iterations with it. Previously, the Active_adding_task subtest would intermittently fail.

Potential risks

N/A — test-only change. The production code and idle-check behavior are unmodified. The increased sleep durations add ~0.3s per subtest, which is negligible for CI.

Release notes

N/A — test-only fix.

Documentation Changes

N/A

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a flaky timing-sensitive test in service/matching/tasklist by increasing the configured idle-check interval and scaling related sleeps to provide more scheduling margin under CI load (closing #7867).

Changes:

  • Increased IdleTasklistCheckInterval in TestCheckIdleTaskList from 10ms to 100ms.
  • Scaled all sleeps in the affected subtests proportionally to preserve the same relative timing/logic.
  • Added an inline comment explaining the timing/race rationale for the change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

require.EqualValues(t, 0, atomic.LoadInt32(&tlm.stopped), "idle check interval had not passed yet")
time.Sleep(20 * time.Millisecond)
time.Sleep(200 * time.Millisecond)
require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass")
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion message grammar: "idle check interval should have pass" should be "idle check interval should have passed" for clarity when the test fails.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed all three occurrences.


time.Sleep(20 * time.Millisecond)
time.Sleep(200 * time.Millisecond)
require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass")
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion message grammar: "idle check interval should have pass" should be "idle check interval should have passed" for clarity when the test fails.

Copilot uses AI. Check for mistakes.

time.Sleep(20 * time.Millisecond)
time.Sleep(200 * time.Millisecond)
require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass")
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion message grammar: "idle check interval should have pass" should be "idle check interval should have passed" for clarity when the test fails.

Suggested change
require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass")
require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have passed")

Copilot uses AI. Check for mistakes.

t.Run("Idle task-list", func(t *testing.T) {
ctrl := gomock.NewController(t)
tlm := createTestTaskListManagerWithConfig(t, testlogger.New(t), ctrl, cfg, clock.NewRealTimeSource())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use MockedTimeSource and make the tests deterministic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Replaced all three subtests with MockedTimeSource and mockClock.Advance() instead of real-time sleeps. The liveness shutdown is now verified with require.Eventually/require.Never, so the tests are fully deterministic.

The test used a 10ms idle-check interval with an 8ms sleep before
calling AddTask. The liveness checker fires at TTL/2 = 5ms intervals,
so the second check at ~10ms raced against AddTask.MarkAlive() with
only a ~2ms window. Under CI load, MarkAlive() often lost the race,
causing the task list to shut down before the task could be added.

Widen the idle-check interval to 100ms and scale all sleep durations
proportionally (80ms pre-AddTask, 60ms post-AddTask, 200ms final
wait). This gives a 20ms window for MarkAlive() — 10× the previous
margin — while preserving the same test logic and coverage ratios.

Closes cadence-workflow#7867

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Fixed "should have pass" to "should have passed" in three assertion
messages in TestCheckIdleTaskList.

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Replace clock.NewRealTimeSource() and time.Sleep-based assertions with
clock.MockedTimeSource and Advance() calls so the liveness idle-check
tests are fully deterministic and not subject to scheduling jitter.

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
@officialasishkumar officialasishkumar force-pushed the fix/flaky-idle-tasklist-check-test branch from 9cc111f to 9f29e33 Compare April 8, 2026 16:31
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 8, 2026

CI failed: Integration test suite encountered timing issues with background goroutines logging after test teardown, likely exacerbated by time-source changes in the test fix.

Overview

Analyzed 1 integration test log from job 70462425304. The test suite failed with FAIL status accompanied by repeated "COULD FAIL TEST" warnings from the global ratelimiter's background update loop. These warnings indicate that async background tasks continued logging after test completion, signaling a race condition or improper shutdown sequencing during test cleanup.

Failures

Async Task Logger Timing Race (confidence: medium)

  • Type: flaky_test
  • Affected jobs: 70462425304
  • Related to change: unclear
  • Root cause: The global ratelimiter's backgroundUpdateLoop is logging debug messages after test teardown has begun. The test's fallback logger lifecycle safeguard detects these late logs and warns "COULD FAIL TEST". This occurs across multiple test suites (TestClientIntegrationSuite, TestDecisionTimeoutMaxAttemptsIntegrationSuite, TestSizeLimitIntegrationSuite, etc.) and multiple rate limit collections (workerTaskList, asyncTaskList, userTaskList).
  • Suggested fix: Ensure all background goroutines in collection.go are properly stopped before test cleanup completes. The PR introduces MockedTimeSource for determinism—verify that time advancement in tests allows background timers to complete cleanly. Add explicit context cancellation or wait groups in test teardown to guarantee background loops exit before the test framework validates logger lifecycle.

Summary

  • Change-related failures: 1 flaky test failure (likely related; PR modifies time-based test behavior)
  • Infrastructure/flaky failures: Timing-sensitive background task shutdown during test teardown
  • Recommended action: Review test cleanup sequencing in task_list_manager_test.go and related integration tests. Ensure MockedTimeSource integration properly advances time so background loops can exit gracefully. Add explicit synchronization points to wait for backgroundUpdateLoop completion before test finishes.
Code Review ✅ Approved

Eliminates race condition in TestCheckIdleTaskList/Active_adding_task test. No issues found.

Rules ✅ All requirements met

Repository Rules

GitHub Issue Linking Requirement: The PR description contains 'Closes #7867', a valid issue link in the cadence-workflow organization that satisfies the issue linking requirement despite exceeding 50 lines changed and having a 'fix' prefix.
PR Description Quality Standards: The PR description meets all quality standards: it provides clear WHAT changed, detailed WHY explanation with timing analysis and root cause, specific copyable test commands for HOW, appropriate N/A marks for risks/release notes/documentation, and links the issue number as required by template guidance.

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestCheckIdleTaskList/Active_adding_task in service/matching/tasklist

5 participants