fix: trim orphaned workflow timer tasks on workflow close and deletion#7941
fix: trim orphaned workflow timer tasks on workflow close and deletion#7941fimanishi wants to merge 6 commits intocadence-workflow:masterfrom
Conversation
CI failed: Two CI failures: (1) integration tests flaking due to background goroutines logging after test completion in the global rate limiter, and (2) linting check failing because generated files were modified by the build process but not committed.OverviewAnalyzed 2 unique error templates across 2 CI logs. One failure is a change-related linting issue requiring committed generated file updates; the other is a flaky infrastructure/timing issue in integration tests unrelated to this PR's changes. FailuresIntegration Test Flakiness from Rate Limiter Goroutine Leaks (confidence: high)
Generated Files Not Committed (confidence: high)
Summary
Code Review ✅ Approved 2 resolved / 2 findingsFixes orphaned workflow timer task cleanup by addressing a nil function pointer panic in timer executor feature-flag check and adding defensive copies to prevent goroutine race conditions when reading mutable state. ✅ 2 resolved✅ Bug: Nil function pointer panic in timer executor feature-flag check
✅ Edge Case: Goroutine reads mutable state slice without defensive copy
Rules ✅ All requirements metRepository Rules
2 rules not applicable. Show all rules by commenting Tip Comment OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
Signed-off-by: fimanishi <fimanishi@gmail.com>
fb4ac53 to
8a77d18
Compare
- Add nil check for EnableOrphanedTimerCleanup in timer_task_executor_base.go consistent with the other two call sites in context.go and execution_manager.go - Defensive copy of workflowTimerTaskInfos slice before spawning goroutine in deleteWorkflowTimerTasksBestEffortAsync to prevent potential data race if mutation-time tracking is added in the future Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
…est.go Signed-off-by: fimanishi <fimanishi@gmail.com>
What changed?
Added tracking and cleanup of orphaned workflow-level timer tasks.
At workflow creation, timer task IDs and timestamps are captured and stored as a blob on the execution record. When the workflow closes, tracked timers are deleted in two places: immediately at close time (async, best-effort) and again at retention-based deletion (with retry) as a safety net. Timers scheduled to fire within a configurable threshold (default 24h) are skipped — they'll fire naturally. Both paths are gated behind a feature flag (system.enableOrphanedWorkflowTimerCleanup, default false).
Fixes #7568
Why?
When a workflow closes before its timers fire, those timer task rows remain in the database until the scheduled time — potentially hours or days later. In Cassandra, this is particularly harmful for cron workflows: all runs share the same partition key, so orphaned timers accumulate in a single partition and degrade database performance. High-frequency workflows with long timeouts cause the same accumulation across partitions.
How did you test it?
go test -race ./common/persistence/...
go test -race ./common/persistence/nosql/nosqlplugin/cassandra/...
go test -race ./service/history/execution/...
go test -race ./service/history/task/...
go test -race ./tools/common/schema/...
go test -race ./service/history/...
Potential risks
Detailed Description
New Cassandra schema migration (v0.47) adds two nullable blob columns to the executions table:
These store a serialized list of workflow-level timer task references (task ID, visibility timestamp, timeout type) on the execution row. New WorkflowTimerTaskInfos field added to WorkflowMutableState. New DeleteTimerTask method added to ExecutionManager and ExecutionStore interfaces.
Impact Analysis
Testing Plan
Rollout Plan
Release notes
Workflow timer tasks are now cleaned up when a workflow closes early, preventing orphaned timer accumulation in the database.
Documentation Changes
system.enableOrphanedWorkflowTimerCleanup
Enables cleanup of orphaned workflow timer tasks when a workflow closes before its timers fire. Requires Cassandra schema v0.47. Safe to enable/disable at any time.
history.orphanedTimerDeletionMinTTL
Timers scheduled to fire within this window are skipped — they'll fire and clean up naturally. Only applies when system.enableOrphanedTimerCleanup is enabled.