Skip to content

fix: trim orphaned workflow timer tasks on workflow close and deletion#7941

Draft
fimanishi wants to merge 6 commits intocadence-workflow:masterfrom
fimanishi:trimming-timers-on-workflow-close
Draft

fix: trim orphaned workflow timer tasks on workflow close and deletion#7941
fimanishi wants to merge 6 commits intocadence-workflow:masterfrom
fimanishi:trimming-timers-on-workflow-close

Conversation

@fimanishi
Copy link
Copy Markdown
Member

@fimanishi fimanishi commented Apr 10, 2026

What changed?
Added tracking and cleanup of orphaned workflow-level timer tasks.

At workflow creation, timer task IDs and timestamps are captured and stored as a blob on the execution record. When the workflow closes, tracked timers are deleted in two places: immediately at close time (async, best-effort) and again at retention-based deletion (with retry) as a safety net. Timers scheduled to fire within a configurable threshold (default 24h) are skipped — they'll fire naturally. Both paths are gated behind a feature flag (system.enableOrphanedWorkflowTimerCleanup, default false).

Fixes #7568

Why?
When a workflow closes before its timers fire, those timer task rows remain in the database until the scheduled time — potentially hours or days later. In Cassandra, this is particularly harmful for cron workflows: all runs share the same partition key, so orphaned timers accumulate in a single partition and degrade database performance. High-frequency workflows with long timeouts cause the same accumulation across partitions.

How did you test it?
go test -race ./common/persistence/...

go test -race ./common/persistence/nosql/nosqlplugin/cassandra/...

go test -race ./service/history/execution/...

go test -race ./service/history/task/...

go test -race ./tools/common/schema/...

go test -race ./service/history/...

Potential risks

  • Critical path: the close-time cleanup runs in UpdateWorkflowExecutionWithNew, which is on the hot path for every workflow state transition. The goroutine is only spawned when the feature flag is enabled and there are tracked tasks, but the flag check and task list check are synchronous on every workflow completion.
  • Best-effort only: both deletion paths can fail silently (errors are logged). A failure leaves the timer orphaned — which is the pre-existing behavior, not a regression.
  • Creation-time tracking only: timers added during workflow updates are not tracked. Timers created at workflow start (e.g. start-to-close timeout) are the primary target and are covered.

Detailed Description

New Cassandra schema migration (v0.47) adds two nullable blob columns to the executions table:

  • workflow_timer_tasks blob
  • workflow_timer_tasks_encoding text

These store a serialized list of workflow-level timer task references (task ID, visibility timestamp, timeout type) on the execution row. New WorkflowTimerTaskInfos field added to WorkflowMutableState. New DeleteTimerTask method added to ExecutionManager and ExecutionStore interfaces.

Impact Analysis

  • Backward Compatibility: Old servers reading rows written by new servers will ignore the new columns — no impact. Old servers writing rows will not populate the new columns, leaving them null — handled gracefully by the new code via nil checks.
  • Forward Compatibility: New servers reading old rows will find null values in the new columns and treat them as empty task lists — no cleanup will run, which is the same behavior as before this change.

Testing Plan

  • Unit Tests: Yes — persistence layer, serialization, timer executor, and execution context are covered.
  • Persistence Tests: No — no Cassandra persistence tests covering the new columns end-to-end.
  • Integration Tests: No.
  • Compatibility Tests: No.

Rollout Plan

  • Apply the Cassandra schema migration (v0.47) before deploying the new server binary. The new columns are additive and the old binary will continue to function normally against the updated schema.
  • Deploy the new server binary. The feature is inactive until the flag is enabled.
  • Enable system.enableOrphanedWorkflowTimerCleanup incrementally (per-cluster or canary) once the new binary is stable.
  • Rollback: Safe at any point. Disabling the flag stops all new cleanup activity. Rolling back the binary is safe — old servers ignore the new columns. The schema migration cannot be rolled back, but the new columns have no impact on old code.
  • Kill switch: system.enableOrphanedWorkflowTimerCleanup (default false).

Release notes
Workflow timer tasks are now cleaned up when a workflow closes early, preventing orphaned timer accumulation in the database.

Documentation Changes
system.enableOrphanedWorkflowTimerCleanup

  • Type: Bool
  • Default: false

Enables cleanup of orphaned workflow timer tasks when a workflow closes before its timers fire. Requires Cassandra schema v0.47. Safe to enable/disable at any time.

history.orphanedTimerDeletionMinTTL

  • Type: Duration
  • Default: 24h

Timers scheduled to fire within this window are skipped — they'll fire and clean up naturally. Only applies when system.enableOrphanedTimerCleanup is enabled.

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 10, 2026

CI failed: Two CI failures: (1) integration tests flaking due to background goroutines logging after test completion in the global rate limiter, and (2) linting check failing because generated files were modified by the build process but not committed.

Overview

Analyzed 2 unique error templates across 2 CI logs. One failure is a change-related linting issue requiring committed generated file updates; the other is a flaky infrastructure/timing issue in integration tests unrelated to this PR's changes.

Failures

Integration Test Flakiness from Rate Limiter Goroutine Leaks (confidence: high)

  • Type: flaky_test
  • Affected jobs: 70870768781
  • Related to change: no
  • Root cause: Background goroutines in the global rate limiter collection (collection.go:335) continue executing after test completion, logging 'logged too late' warnings. Multiple test suites (TestTaskListIsolationSuite, TestDecisionTimeoutMaxAttemptsIntegrationSuite, TestSizeLimitIntegrationSuite, TestTaskListSuite, TestWorkflowIDRateLimitIntegrationSuite, TestIntegrationSuite) are affected. The backgroundUpdateLoop is not properly synchronized with test teardown, causing race conditions.
  • Suggested fix: Improve test teardown synchronization to drain all background goroutines before test completion. Add timeout-based context cancellation checks in Collection cleanup or ensure proper shutdown signaling for the backgroundUpdateLoop.

Generated Files Not Committed (confidence: high)

  • Type: build
  • Affected jobs: (linting/coverage job)
  • Related to change: yes
  • Root cause: After running make tidy, make fmt, and make lint, generated files were modified but not committed. The build process regenerates files (mocks, proto compilations, execution managers, data managers) that differ from what's committed in the PR.
  • Suggested fix: Run make tidy && make go-generate && make fmt && make lint locally, then commit all generated file changes (BufferedEvents, WorkflowTimerTasks serialization, mock updates, etc.) to the PR.

Summary

  • Change-related failures: 1 — linting check failure from uncommitted generated file changes
  • Infrastructure/flaky failures: 1 — integration test flakiness from background goroutine synchronization issue (pre-existing, unrelated to PR changes)
  • Recommended action: (1) Commit generated file changes to fix the build/lint failure. (2) The test flakiness appears to be an existing infrastructure issue; investigate separately or mark as known flaky in test suite configuration.
Code Review ✅ Approved 2 resolved / 2 findings

Fixes orphaned workflow timer task cleanup by addressing a nil function pointer panic in timer executor feature-flag check and adding defensive copies to prevent goroutine race conditions when reading mutable state.

✅ 2 resolved
Bug: Nil function pointer panic in timer executor feature-flag check

📄 service/history/task/timer_task_executor_base.go:370 📄 common/persistence/sql/sql_execution_store.go:1356-1364 📄 common/persistence/execution_manager.go:675-689
In timer_task_executor_base.go:370, EnableOrphanedTimerCleanup is called directly without a nil guard:

if !t.shard.GetConfig().EnableOrphanedTimerCleanup() {

However, EnableOrphanedTimerCleanup is a function pointer on the Config struct and can be nil (e.g., the test at context_test.go:1774 returns &config.Config{} with all zero fields). This will panic at runtime when the retention-based deletion path fires for any workflow.

The other two call sites correctly guard against nil:

  • context.go:941: cfg.EnableOrphanedTimerCleanup == nil || !cfg.EnableOrphanedTimerCleanup()
  • execution_manager.go:678: m.dc.EnableOrphanedTimerCleanup == nil || !m.dc.EnableOrphanedTimerCleanup()
Edge Case: Goroutine reads mutable state slice without defensive copy

📄 service/history/execution/context.go:944 📄 service/history/execution/context.go:957-971 📄 service/history/execution/mutable_state_builder_methods_timer.go:286-288
GetPendingWorkflowTimerTaskInfos() returns a direct reference to the internal e.workflowTimerTaskInfos slice. In deleteWorkflowTimerTasksBestEffortAsync, this slice is captured at line 944 and then read by a background goroutine at line 961. After UpdateWorkflowExecutionWithNew returns, the caller releases the context lock, and the mutable state can be cleared or reused. While currently safe because Clear() nils the mutableState pointer without mutating the slice backing array, this is fragile — any future code that appends to or modifies workflowTimerTaskInfos (e.g., when mutation-time tracking is added per the TODO at line 1480) could introduce a data race.

Rules ✅ All requirements met

Repository Rules

PR Description Quality Standards: The PR description meets all quality standards with clear WHAT/WHY/HOW sections, test commands, risk documentation, release notes, and technical rationale.

2 rules not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Signed-off-by: fimanishi <fimanishi@gmail.com>
@fimanishi fimanishi force-pushed the trimming-timers-on-workflow-close branch from fb4ac53 to 8a77d18 Compare April 10, 2026 22:14
Comment thread service/history/task/timer_task_executor_base.go Outdated
Comment thread service/history/execution/context.go
- Add nil check for EnableOrphanedTimerCleanup in timer_task_executor_base.go
  consistent with the other two call sites in context.go and execution_manager.go
- Defensive copy of workflowTimerTaskInfos slice before spawning goroutine
  in deleteWorkflowTimerTasksBestEffortAsync to prevent potential data race
  if mutation-time tracking is added in the future

Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
…est.go

Signed-off-by: fimanishi <fimanishi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Timers / large-partition vulnerability with workflows with long timeouts & cron

1 participant