feat(simulation): add cached scheduled queue simulation scenarios by arzonus · Pull Request #7957 · cadence-workflow/cadence

arzonus · 2026-04-15T09:28:45Z

What changed?

Adds two end-to-end simulation scenarios for the cached timer queue reader (relates to #7953).

Part of a stack — merge after PR #7956. Requires the simulation infra from PR #7952 (NumWorkflowSleeps, SleepBetweenWorkflowStarts, configurable timeout) to be merged first.

Why?

The cached scheduled queue needs end-to-end validation in a real cluster before being enabled in production. Simulation tests spin up a local Docker cluster and run real workflows, which catches integration issues that unit tests cannot (e.g. task loss, double-processing, cache/DB divergence under load).

Two scenarios cover the two planned rollout stages:

queuev2_scheduled_cache_shadow — runs 1 000 workflows with 10 timer sleeps each under shadow mode. The cache is populated and compared against DB reads, but correctness is unaffected. Mismatches surface as warnings before enabling the cache.
queuev2_scheduled_cache_enabled — same workload under enabled mode. Timer task reads are served from the in-memory cache where the look-ahead window covers the task.

How did you test it?

./simulation/history/run.sh --scenario queuev2_scheduled_cache_shadow --dockerfile-suffix .local
./simulation/history/run.sh --scenario queuev2_scheduled_cache_enabled --dockerfile-suffix .local

Both passed. All created transfer tasks were executed. Unexecuted timer tasks (types 3 and 4 — next-day retention timers) are expected and are not a failure condition.

Potential risks

None — changes are confined to simulation test infrastructure and YAML config files. No production code is affected.

Release notes

N/A — simulation test infrastructure only.

Documentation Changes

N/A

gitar-bot · 2026-04-15T10:06:26Z

+func (q *cachedQueueReader) updateExclusiveUpperBound(key persistence.HistoryTaskKey) {
+	q.logger.Debug("upper bound advancing",
+		tag.Dynamic("prevUpperBound", q.exclusiveUpperBound),
+		tag.Dynamic("newUpperBound", key),
+	)
+	q.exclusiveUpperBound = key
+	q.notifyPrefetch()
+}


💡 Bug: notifyPrefetch called while holding q.mu may delay prefetch

updateExclusiveUpperBound (line 242) sends on q.prefetchCh via notifyPrefetch. The comment on line 241 says 'Caller must hold q.mu', and indeed it is called from prefetch() and putTasks() under q.mu.Lock(). In prefetchLoop, when prefetchCh fires (line 191), nextPrefetchDelay() tries to acquire q.mu.RLock() (line 255). Because Go's sync.RWMutex is writer-preferring, if any other goroutine is waiting for q.mu.Lock(), the RLock in nextPrefetchDelay will block. This isn't a deadlock (the lock holder is the sender, not the receiver), but it's worth noting that notifyPrefetch is non-blocking (buffered channel, line 234-237), so the signal is just queued for the next loop iteration. No functional bug, but the pattern is subtle and a comment would help.

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>

…ction Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>

Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>

gitar-bot · 2026-04-15T14:16:03Z

CI failed: Integration tests are failing due to race conditions in the global ratelimiter's background update loop, which logs warnings after test teardown.

Overview

Multiple integration test suites are failing due to 'logged too late' errors in the global-ratelimiter component. These errors appear consistently during test teardown across various suites, indicating a potential race condition where background goroutines outlive the test execution scope.

Failures

Global Ratelimiter Background Log Mismatch (confidence: high)

Type: flaky_test
Affected jobs: Various integration test suites (including TestTaskListIsolationSuite, TestSizeLimitIntegrationSuite, TestTaskListSuite, etc.)
Related to change: Yes
Root cause: The collection.backgroundUpdateLoop in the global ratelimiter continues to log after a test has finished, triggering the testlogger package's enforcement against late logging.
Suggested fix: Since this behavior is exacerbated by recent changes, investigate if the new CachedScheduledQueue or CachedQueueReader logic is interacting with, or creating, long-lived components that share the global ratelimiter. Ensure proper cleanup or shutdown of the ratelimiter background loop within the test's tear-down phase to ensure all background tasks are quiesced before the test completes.

Summary

Change-related failures: 1, specifically integration test timeouts/crashes due to background logging races.
Infrastructure/flaky failures: 0 (The issue is likely exacerbated by, rather than independent of, the new simulation components).
Recommended action: Review the initialization and lifecycle management of the global-ratelimiter in the new simulation scenarios. Ensure that any newly instantiated components that rely on this rate limiter are properly closed or canceled in the test TearDown methods.

Code Review 👍 Approved with suggestions 5 resolved / 6 findings

Implements cached scheduled queue simulation scenarios and resolves performance issues related to scan efficiency, cache hit boundaries, and warmup synchronization. Ensure notifyPrefetch is called outside the q.mu lock to avoid potential prefetch delays.

💡 Bug: notifyPrefetch called while holding q.mu may delay prefetch

📄 service/history/queuev2/queue_reader_cached.go:242-249 📄 service/history/queuev2/queue_reader_cached.go:254-268

updateExclusiveUpperBound (line 242) sends on q.prefetchCh via notifyPrefetch. The comment on line 241 says 'Caller must hold q.mu', and indeed it is called from prefetch() and putTasks() under q.mu.Lock(). In prefetchLoop, when prefetchCh fires (line 191), nextPrefetchDelay() tries to acquire q.mu.RLock() (line 255). Because Go's sync.RWMutex is writer-preferring, if any other goroutine is waiting for q.mu.Lock(), the RLock in nextPrefetchDelay will block. This isn't a deadlock (the lock holder is the sender, not the receiver), but it's worth noting that notifyPrefetch is non-blocking (buffered channel, line 234-237), so the signal is just queued for the next loop iteration. No functional bug, but the pattern is subtle and a comment would help.

✅ 5 resolved

✅ Performance: InMemQueue.GetTasks scans all matching tasks before truncating

📄 service/history/queuev2/queue_mem.go:128-141
The GetTasks loop collects every predicate-matching task in the range into a slice, then truncates to PageSize afterward. If the cache contains many tasks (up to MaxSize, default 1000) and PageSize is small (e.g. 100), the intermediate allocation is unnecessary. Adding a break once len(tasks) > req.PageSize would short-circuit the scan and avoid extra predicate evaluations and allocations.

✅ Bug: LookAHead cache hit ignores inclusiveLowerBound, may skip tasks

📄 service/history/queuev2/queue_reader_cached.go:644-646
In cachedQueueReader.LookAHead, the coverage check on line 645-646 only verifies that exclusiveUpperBound is initialized and that the request's InclusiveMinTaskKey is at-or-before the upper bound. Unlike GetTask (which calls isRangeCovered checking both bounds), LookAHead does not verify that InclusiveMinTaskKey >= inclusiveLowerBound.

If the request's InclusiveMinTaskKey falls below the cache's inclusiveLowerBound (e.g., because time-based eviction advanced the lower bound ahead of the processor's actual progress), the cache will report "covered" and return the first task it has (at or after inclusiveLowerBound). But the DB may still contain un-acked tasks in [InclusiveMinTaskKey, inclusiveLowerBound) that were evicted. This would cause the timer gate to be armed too far into the future, delaying processing of those tasks until the next full poll cycle.

In practice the eviction safe window makes this unlikely, but the asymmetry with GetTask is a latent correctness issue.

✅ Edge Case: Inject proceeds before Start if called early (warmup bypassed)

📄 service/history/queuev2/queue_reader_cached.go:482-496
In cachedQueueReader.Inject, the warmup check compares clock.Now() against startTime + WarmupGracePeriod. Before Start() is called, startTime is the zero time.Time{}, so startTime.Add(gracePeriod) resolves to a date in year 0001—always in the past. This means Inject will accept tasks into the cache before the background prefetch loop is running.

In practice this is unlikely to matter because cachedScheduledQueue.Start() calls reader.Start() before scheduledQueue.Start(), and NotifyNewTask is invoked by the shard engine after the queue is started. However, if a caller were to invoke NotifyNewTask before Start() completes, tasks would be inserted into the cache without any background eviction or prefetch running, potentially violating the cache window invariants.

Consider adding a status check in Inject (similar to Start/Stop) so it's a no-op when the daemon is not started, or document this as an accepted precondition.

✅ Edge Case: Inject bypasses warmup when called before Start()

📄 service/history/queuev2/queue_reader_cached.go:482-496
The warmup check in Inject (line 491) compares q.clock.Now().Before(q.startTime.Add(q.options.WarmupGracePeriod())). Before Start() is called, q.startTime is the zero-value time.Time{}, so q.startTime.Add(30s) evaluates to a time in year 0000. Since Now() is always after year-0000, inWarmup evaluates to false and Inject proceeds to insert tasks into the cache — exactly the opposite of the intended protection.

In the current wiring (newCachedScheduledQueue → cachedScheduledQueue.Start), Start is called before the queue processes events, so NotifyNewTask → Inject is unlikely to fire before Start. However, the CachedQueueReader interface is public and there's no status check guarding Inject, so any future caller that injects tasks before starting the reader will silently bypass warmup.

This was flagged in the previous round and is still present.

✅ Performance: InMemQueue.PutTasks is O(n*m) for unsorted batch inserts

📄 service/history/queuev2/queue_mem.go:98-102 📄 service/history/queuev2/queue_mem.go:64-78
PutTasks (line 98) inserts tasks one-by-one via putTask, which does a binary-search + slice-copy for each non-append-path task. When the incoming batch is not sorted by key (e.g. mixed timer types), each mid-slice insertion shifts O(n) elements, giving O(n*m) total work where m is batch size and n is queue length. For the current default cap of 1000 this is fine, but if MaxSize is increased this could become a bottleneck. A sort-then-merge approach would be O((n+m) log(n+m)).

🤖 Prompt for agents

Code Review: Implements cached scheduled queue simulation scenarios and resolves performance issues related to scan efficiency, cache hit boundaries, and warmup synchronization. Ensure `notifyPrefetch` is called outside the `q.mu` lock to avoid potential prefetch delays.

1. 💡 Bug: notifyPrefetch called while holding q.mu may delay prefetch
   Files: service/history/queuev2/queue_reader_cached.go:242-249, service/history/queuev2/queue_reader_cached.go:254-268

   `updateExclusiveUpperBound` (line 242) sends on `q.prefetchCh` via `notifyPrefetch`. The comment on line 241 says 'Caller must hold q.mu', and indeed it is called from `prefetch()` and `putTasks()` under `q.mu.Lock()`. In `prefetchLoop`, when `prefetchCh` fires (line 191), `nextPrefetchDelay()` tries to acquire `q.mu.RLock()` (line 255). Because Go's `sync.RWMutex` is writer-preferring, if any other goroutine is waiting for `q.mu.Lock()`, the RLock in `nextPrefetchDelay` will block. This isn't a deadlock (the lock holder is the sender, not the receiver), but it's worth noting that `notifyPrefetch` is non-blocking (buffered channel, line 234-237), so the signal is just queued for the next loop iteration. No functional bug, but the pattern is subtle and a comment would help.

Rules ✅ All requirements met

Repository Rules

✅ GitHub Issue Linking Requirement: The PR contains a reference to issue #7953, satisfying the requirement to link to a cadence-workflow issue.

✅ PR Description Quality Standards: The PR description strictly follows the template structure and provides all required information, including technical context, motivation, clear test commands, and risk assessment.

_{1 rule not applicable. Show all rules by commenting gitar display:verbose.}

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

arzonus · 2026-04-16T07:23:04Z

Closing in favour of a single consolidated draft PR. All changes are preserved in the feature branch.

arzonus requested review from Assem-Uber, Shaddoll, abhishekj720, adhityamamallan, bowenxia, c-warren, demirkayaender, dkrotx, eleonoradgr, fimanishi, gazi-yestemirova, jakobht, macrotim, natemort, neil-xie, shijiesheng and timl3136 as code owners April 15, 2026 09:28