fix(test): eliminate race in TestCheckIdleTaskList/Active_adding_task by officialasishkumar · Pull Request #7891 · cadence-workflow/cadence

officialasishkumar · 2026-04-04T09:27:22Z

What changed?

Increased the idle-check interval in TestCheckIdleTaskList from 10ms to 100ms and scaled all sleep durations proportionally. Closes #7867.

Why?

The test used a 10ms IdleTasklistCheckInterval with an 8ms sleep before calling AddTask. The liveness checker fires at TTL/2 = 5ms intervals, so the second check at ~10ms raced against AddTask.MarkAlive() with only a ~2ms window. Under CI load, MarkAlive() often lost the race, causing the task list to shut down before the task could be added, failing with task list shutting down.

The timing analysis:

Before: 10ms TTL, liveness checks at 5/10/15ms. Sleep 8ms then call AddTask which must reach MarkAlive() within ~2ms before the 10ms check fires. Under heavy CI load, this 2ms window is insufficient.
After: 100ms TTL, liveness checks at 50/100/150ms. Sleep 80ms then call AddTask which now has ~20ms before the 100ms check fires — 10× the previous margin.

The test logic, coverage ratios, and assertions remain identical; only the absolute timing values have changed.

How did you test it?

go test -v -run TestCheckIdleTaskList -count=5 ./service/matching/tasklist/...
go test -race -v -run TestCheckIdleTaskList -count=3 ./service/matching/tasklist/...

All subtests (Idle_task-list, Active_poll-er, Active_adding_task) passed on all 5 iterations without the race detector and 3 iterations with it. Previously, the Active_adding_task subtest would intermittently fail.

Potential risks

N/A — test-only change. The production code and idle-check behavior are unmodified. The increased sleep durations add ~0.3s per subtest, which is negligible for CI.

Release notes

N/A — test-only fix.

Documentation Changes

N/A

Copilot

Pull request overview

This PR addresses a flaky timing-sensitive test in service/matching/tasklist by increasing the configured idle-check interval and scaling related sleeps to provide more scheduling margin under CI load (closing #7867).

Changes:

Increased IdleTasklistCheckInterval in TestCheckIdleTaskList from 10ms to 100ms.
Scaled all sleeps in the affected subtests proportionally to preserve the same relative timing/logic.
Added an inline comment explaining the timing/race rationale for the change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T09:29:14Z

 		require.EqualValues(t, 0, atomic.LoadInt32(&tlm.stopped), "idle check interval had not passed yet")
-		time.Sleep(20 * time.Millisecond)
+		time.Sleep(200 * time.Millisecond)
 		require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass")


Assertion message grammar: "idle check interval should have pass" should be "idle check interval should have passed" for clarity when the test fails.

Good catch, fixed all three occurrences.

Copilot · 2026-04-04T09:29:14Z


-		time.Sleep(20 * time.Millisecond)
+		time.Sleep(200 * time.Millisecond)
 		require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass")


Assertion message grammar: "idle check interval should have pass" should be "idle check interval should have passed" for clarity when the test fails.

Copilot · 2026-04-04T09:29:15Z


-		time.Sleep(20 * time.Millisecond)
+		time.Sleep(200 * time.Millisecond)
 		require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass")


Assertion message grammar: "idle check interval should have pass" should be "idle check interval should have passed" for clarity when the test fails.

Suggested change

require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass")

require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have passed")

Shaddoll · 2026-04-07T17:37:28Z


 	t.Run("Idle task-list", func(t *testing.T) {
 		ctrl := gomock.NewController(t)
 		tlm := createTestTaskListManagerWithConfig(t, testlogger.New(t), ctrl, cfg, clock.NewRealTimeSource())


Can we use MockedTimeSource and make the tests deterministic?

Done. Replaced all three subtests with MockedTimeSource and mockClock.Advance() instead of real-time sleeps. The liveness shutdown is now verified with require.Eventually/require.Never, so the tests are fully deterministic.

The test used a 10ms idle-check interval with an 8ms sleep before calling AddTask. The liveness checker fires at TTL/2 = 5ms intervals, so the second check at ~10ms raced against AddTask.MarkAlive() with only a ~2ms window. Under CI load, MarkAlive() often lost the race, causing the task list to shut down before the task could be added. Widen the idle-check interval to 100ms and scale all sleep durations proportionally (80ms pre-AddTask, 60ms post-AddTask, 200ms final wait). This gives a 20ms window for MarkAlive() — 10× the previous margin — while preserving the same test logic and coverage ratios. Closes cadence-workflow#7867 Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

Fixed "should have pass" to "should have passed" in three assertion messages in TestCheckIdleTaskList. Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

Replace clock.NewRealTimeSource() and time.Sleep-based assertions with clock.MockedTimeSource and Advance() calls so the liveness idle-check tests are fully deterministic and not subject to scheduling jitter. Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

gitar-bot · 2026-04-08T16:34:28Z

CI failed: Integration test suite encountered timing issues with background goroutines logging after test teardown, likely exacerbated by time-source changes in the test fix.

Overview

Analyzed 1 integration test log from job 70462425304. The test suite failed with FAIL status accompanied by repeated "COULD FAIL TEST" warnings from the global ratelimiter's background update loop. These warnings indicate that async background tasks continued logging after test completion, signaling a race condition or improper shutdown sequencing during test cleanup.

Failures

Async Task Logger Timing Race (confidence: medium)

Type: flaky_test
Affected jobs: 70462425304
Related to change: unclear
Root cause: The global ratelimiter's backgroundUpdateLoop is logging debug messages after test teardown has begun. The test's fallback logger lifecycle safeguard detects these late logs and warns "COULD FAIL TEST". This occurs across multiple test suites (TestClientIntegrationSuite, TestDecisionTimeoutMaxAttemptsIntegrationSuite, TestSizeLimitIntegrationSuite, etc.) and multiple rate limit collections (workerTaskList, asyncTaskList, userTaskList).
Suggested fix: Ensure all background goroutines in collection.go are properly stopped before test cleanup completes. The PR introduces MockedTimeSource for determinism—verify that time advancement in tests allows background timers to complete cleanly. Add explicit context cancellation or wait groups in test teardown to guarantee background loops exit before the test framework validates logger lifecycle.

Summary

Change-related failures: 1 flaky test failure (likely related; PR modifies time-based test behavior)
Infrastructure/flaky failures: Timing-sensitive background task shutdown during test teardown
Recommended action: Review test cleanup sequencing in task_list_manager_test.go and related integration tests. Ensure MockedTimeSource integration properly advances time so background loops can exit gracefully. Add explicit synchronization points to wait for backgroundUpdateLoop completion before test finishes.

Code Review ✅ Approved

Eliminates race condition in TestCheckIdleTaskList/Active_adding_task test. No issues found.

Rules ✅ All requirements met

Repository Rules

✅ GitHub Issue Linking Requirement: The PR description contains 'Closes #7867', a valid issue link in the cadence-workflow organization that satisfies the issue linking requirement despite exceeding 50 lines changed and having a 'fix' prefix.

✅ PR Description Quality Standards: The PR description meets all quality standards: it provides clear WHAT changed, detailed WHY explanation with timing analysis and root cause, specific copyable test commands for HOW, appropriate N/A marks for risks/release notes/documentation, and links the issue number as required by template guidance.

_{1 rule not applicable. Show all rules by commenting gitar display:verbose.}

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot AI review requested due to automatic review settings April 4, 2026 09:27

Copilot started reviewing on behalf of officialasishkumar April 4, 2026 09:27 View session

gitar-bot bot assigned natemort and c-warren Apr 4, 2026

Copilot AI reviewed Apr 4, 2026

View reviewed changes

Shaddoll reviewed Apr 7, 2026

View reviewed changes

officialasishkumar added 3 commits April 8, 2026 16:30

fix(test): correct grammar in idle check assertion messages

0a905fd

Fixed "should have pass" to "should have passed" in three assertion messages in TestCheckIdleTaskList. Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

officialasishkumar force-pushed the fix/flaky-idle-tasklist-check-test branch from 9cc111f to 9f29e33 Compare April 8, 2026 16:31

Shaddoll approved these changes Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(test): eliminate race in TestCheckIdleTaskList/Active_adding_task#7891

fix(test): eliminate race in TestCheckIdleTaskList/Active_adding_task#7891
officialasishkumar wants to merge 3 commits intocadence-workflow:masterfrom
officialasishkumar:fix/flaky-idle-tasklist-check-test

officialasishkumar commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

officialasishkumar Apr 4, 2026

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Shaddoll Apr 7, 2026

Uh oh!

officialasishkumar Apr 8, 2026

Uh oh!

gitar-bot bot commented Apr 8, 2026 •

edited

Loading

Overview

Failures

Async Task Logger Timing Race (confidence: medium)

Summary

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass")
	require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have passed")

Conversation

officialasishkumar commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

officialasishkumar Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Shaddoll Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

officialasishkumar Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Failures

Async Task Logger Timing Race (confidence: medium)

Summary

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gitar-bot bot commented Apr 8, 2026 •

edited

Loading