Skip to content

feat(queuev2): add InMemQueue, cache config, and task key helpers#7954

Closed
arzonus wants to merge 1 commit intocadence-workflow:masterfrom
arzonus:queuev2-foundation
Closed

feat(queuev2): add InMemQueue, cache config, and task key helpers#7954
arzonus wants to merge 1 commit intocadence-workflow:masterfrom
arzonus:queuev2-foundation

Conversation

@arzonus
Copy link
Copy Markdown
Contributor

@arzonus arzonus commented Apr 15, 2026

What changed?

Adds the foundational building blocks for the cached timer queue reader (relates to #7953):

  • InMemQueue — a sorted in-memory task store backed by binary-search insertion, with PutTasks, GetTasks, LookAHead, LTrim, RTrimBySize, and Clear. Deduplicates on insert by task key.
  • Nine new dynamic config properties for the cache (TimerProcessorCacheMaxSize, CacheMaxLookAheadWindow, CachePrefetchTriggerWindow, CacheWarmupGracePeriod, CacheEvictionSafeWindow, CacheMinPrefetchInterval, CacheTimeEvictionInterval, CachedQueueReaderMode, EnableCachedScheduledQueue). All are global (no shard filter).
  • HistoryTaskKey.Less and HistoryTaskKey.Equal helpers used by InMemQueue for ordering and deduplication.
  • New metric definitions for cache hit/miss counters and a size histogram.

Why?

The timer queue reader currently issues a DB round-trip for every GetTask call. The goal is to eliminate those reads for tasks that fall within a pre-fetched look-ahead window. This PR lays the groundwork — the InMemQueue is the in-memory store the cached reader will use, and the config properties control its behaviour. Nothing is wired into production code yet.

How did you test it?

go test -race -count=1 ./service/history/queuev2/... ./service/history/config/... ./common/persistence/...
make pr GEN_DIR=service/history/queuev2
make build

All passed.

Potential risks

None — no changes to existing production code paths. All new types and config keys.

Release notes

N/A — internal change, not user-facing.

Documentation Changes

N/A

Comment thread service/history/queuev2/queue_mem.go
Comment thread service/history/queuev2/queue_mem.go
@arzonus arzonus force-pushed the queuev2-foundation branch from 45aa0a2 to ae626f8 Compare April 15, 2026 09:44
Comment thread service/history/queuev2/queue_mem.go
@arzonus arzonus force-pushed the queuev2-foundation branch 6 times, most recently from a3bb473 to ab9fe3a Compare April 15, 2026 14:00
Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
@arzonus arzonus force-pushed the queuev2-foundation branch from ab9fe3a to 42af244 Compare April 15, 2026 14:14
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 15, 2026

CI failed: The `TestReplicationSimulation` integration test failed due to a `deadline-exceeded` error during a regional failover operation, likely caused by latency or synchronization issues introduced by the new `queuev2` components.

Overview

The TestReplicationSimulation integration test consistently failed during the regional failover simulation. This failure occurs when attempting to start a workflow immediately following a change to cluster attributes, suggesting a potential regression in request handling or service responsiveness in the new queuev2 implementation.

Failures

Replication Simulation Failure (confidence: medium)

  • Type: test
  • Affected jobs: 71469692572
  • Related to change: yes
  • Root cause: The system failed to handle a workflow start request following a failover, resulting in a deadline-exceeded error and RST_STREAM (CANCEL). This points to either a deadlock, increased latency, or incomplete state propagation within the InMemQueue or cache configuration components introduced in this PR.
  • Suggested fix: Investigate the interaction between InMemQueue and the replication control plane. Verify if the failover status propagation is being blocked by the new queue logic and consider increasing the timeout for simulation steps if the behavior is due to expected initialization overhead.

Summary

  • Change-related failures: 1, specifically within the replication integration test suite.
  • Infrastructure/flaky failures: 0.
  • Recommended action: Review the queuev2 integration points. Use local simulation logs to determine if the InMemQueue is correctly handling requests during cluster attribute transitions and ensure that no blocking calls were introduced in the request path.
Code Review ✅ Approved 3 resolved / 3 findings

InMemQueue implementation now safely handles zero-page-size requests, optimizes memory efficiency by truncating before collection, and prevents memory leaks by clearing interface references. All identified logic and memory management issues are resolved.

✅ 3 resolved
Bug: GetTasks panics when PageSize is 0 and tasks match

📄 service/history/queuev2/queue_mem.go:139-141
In GetTasks, after the scan loop, when len(tasks) > req.PageSize is true (e.g. 1 > 0), tasks is truncated to tasks[:0] (empty slice). Then tasks[len(tasks)-1] evaluates to tasks[-1], causing an index-out-of-range panic.

While callers are expected to pass a positive PageSize, PageSize is a plain int field defaulting to 0. A defensive check avoids a production panic if a caller ever passes 0.

Performance: GetTasks collects all matching tasks before truncating to PageSize

📄 service/history/queuev2/queue_mem.go:128-141
The scan loop (lines 129-136) appends every predicate-matching task to tasks across the entire [min, max) key range, then truncates to PageSize afterward. If the range contains many more matching tasks than PageSize, this unnecessarily allocates and fills a large slice.

The simpleQueueReader.GetTask in queue_reader.go delegates page-size limiting to the DB, so it doesn't have this issue. The in-memory variant should break early too.

Performance: RTrimBySize retains GC references to trimmed interface values

📄 service/history/queuev2/queue_mem.go:198-201
RTrimBySize truncates the slice with q.tasks = q.tasks[:maxSize] (line 200), but since persistence.Task is an interface type, the backing array still holds references to the removed task objects beyond maxSize. These objects won't be garbage-collected until the backing array itself is reallocated.

Contrast with LTrim (line 179) which correctly copies into a fresh slice, releasing old references. For consistency and to avoid unbounded memory retention after repeated trims, RTrimBySize should nil out the tail elements or allocate a new slice.

Rules ✅ All requirements met

Repository Rules

GitHub Issue Linking Requirement: The PR description contains a valid issue link (#7953) which references a cadence-workflow issue.
PR Description Quality Standards: The PR description contains all required sections with substantive content, clearly explains the 'Why' for the changes, and includes specific test commands.

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@arzonus
Copy link
Copy Markdown
Contributor Author

arzonus commented Apr 16, 2026

Closing in favour of a single consolidated draft PR. All changes are preserved in the feature branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant