fix(test): eliminate race in TestCheckIdleTaskList/Active_adding_task#7891
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a flaky timing-sensitive test in service/matching/tasklist by increasing the configured idle-check interval and scaling related sleeps to provide more scheduling margin under CI load (closing #7867).
Changes:
- Increased
IdleTasklistCheckIntervalinTestCheckIdleTaskListfrom 10ms to 100ms. - Scaled all sleeps in the affected subtests proportionally to preserve the same relative timing/logic.
- Added an inline comment explaining the timing/race rationale for the change.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| require.EqualValues(t, 0, atomic.LoadInt32(&tlm.stopped), "idle check interval had not passed yet") | ||
| time.Sleep(20 * time.Millisecond) | ||
| time.Sleep(200 * time.Millisecond) | ||
| require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass") |
There was a problem hiding this comment.
Assertion message grammar: "idle check interval should have pass" should be "idle check interval should have passed" for clarity when the test fails.
There was a problem hiding this comment.
Good catch, fixed all three occurrences.
|
|
||
| time.Sleep(20 * time.Millisecond) | ||
| time.Sleep(200 * time.Millisecond) | ||
| require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass") |
There was a problem hiding this comment.
Assertion message grammar: "idle check interval should have pass" should be "idle check interval should have passed" for clarity when the test fails.
|
|
||
| time.Sleep(20 * time.Millisecond) | ||
| time.Sleep(200 * time.Millisecond) | ||
| require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass") |
There was a problem hiding this comment.
Assertion message grammar: "idle check interval should have pass" should be "idle check interval should have passed" for clarity when the test fails.
| require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have pass") | |
| require.EqualValues(t, 1, atomic.LoadInt32(&tlm.stopped), "idle check interval should have passed") |
|
|
||
| t.Run("Idle task-list", func(t *testing.T) { | ||
| ctrl := gomock.NewController(t) | ||
| tlm := createTestTaskListManagerWithConfig(t, testlogger.New(t), ctrl, cfg, clock.NewRealTimeSource()) |
There was a problem hiding this comment.
Can we use MockedTimeSource and make the tests deterministic?
There was a problem hiding this comment.
Done. Replaced all three subtests with MockedTimeSource and mockClock.Advance() instead of real-time sleeps. The liveness shutdown is now verified with require.Eventually/require.Never, so the tests are fully deterministic.
The test used a 10ms idle-check interval with an 8ms sleep before calling AddTask. The liveness checker fires at TTL/2 = 5ms intervals, so the second check at ~10ms raced against AddTask.MarkAlive() with only a ~2ms window. Under CI load, MarkAlive() often lost the race, causing the task list to shut down before the task could be added. Widen the idle-check interval to 100ms and scale all sleep durations proportionally (80ms pre-AddTask, 60ms post-AddTask, 200ms final wait). This gives a 20ms window for MarkAlive() — 10× the previous margin — while preserving the same test logic and coverage ratios. Closes cadence-workflow#7867 Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Fixed "should have pass" to "should have passed" in three assertion messages in TestCheckIdleTaskList. Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Replace clock.NewRealTimeSource() and time.Sleep-based assertions with clock.MockedTimeSource and Advance() calls so the liveness idle-check tests are fully deterministic and not subject to scheduling jitter. Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
9cc111f to
9f29e33
Compare
CI failed: Integration test suite encountered timing issues with background goroutines logging after test teardown, likely exacerbated by time-source changes in the test fix.OverviewAnalyzed 1 integration test log from job 70462425304. The test suite failed with FAIL status accompanied by repeated "COULD FAIL TEST" warnings from the global ratelimiter's background update loop. These warnings indicate that async background tasks continued logging after test completion, signaling a race condition or improper shutdown sequencing during test cleanup. FailuresAsync Task Logger Timing Race (confidence: medium)
Summary
Code Review ✅ ApprovedEliminates race condition in TestCheckIdleTaskList/Active_adding_task test. No issues found. Rules ✅ All requirements metRepository Rules
1 rule not applicable. Show all rules by commenting Tip Comment OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
What changed?
Increased the idle-check interval in
TestCheckIdleTaskListfrom 10ms to 100ms and scaled all sleep durations proportionally. Closes #7867.Why?
The test used a 10ms
IdleTasklistCheckIntervalwith an 8ms sleep before callingAddTask. The liveness checker fires atTTL/2 = 5msintervals, so the second check at ~10ms raced againstAddTask.MarkAlive()with only a ~2ms window. Under CI load,MarkAlive()often lost the race, causing the task list to shut down before the task could be added, failing withtask list shutting down.The timing analysis:
AddTaskwhich must reachMarkAlive()within ~2ms before the 10ms check fires. Under heavy CI load, this 2ms window is insufficient.AddTaskwhich now has ~20ms before the 100ms check fires — 10× the previous margin.The test logic, coverage ratios, and assertions remain identical; only the absolute timing values have changed.
How did you test it?
All subtests (
Idle_task-list,Active_poll-er,Active_adding_task) passed on all 5 iterations without the race detector and 3 iterations with it. Previously, theActive_adding_tasksubtest would intermittently fail.Potential risks
N/A — test-only change. The production code and idle-check behavior are unmodified. The increased sleep durations add ~0.3s per subtest, which is negligible for CI.
Release notes
N/A — test-only fix.
Documentation Changes
N/A