Skip to content

feat: Terminate corrupted workflow if repair is not possible#7878

Open
fimanishi wants to merge 9 commits intocadence-workflow:masterfrom
fimanishi:terminate-corrupted-workflows-if-no-repair-possible
Open

feat: Terminate corrupted workflow if repair is not possible#7878
fimanishi wants to merge 9 commits intocadence-workflow:masterfrom
fimanishi:terminate-corrupted-workflows-if-no-repair-possible

Conversation

@fimanishi
Copy link
Copy Markdown
Member

@fimanishi fimanishi commented Apr 1, 2026

What changed?
Add workflow termination logic to terminate workflow if corruption is detected but repair fails due to non transient errors. There are two termination modes. Graceful is attempted first. It tries to add the termination event before terminating the workflow. If graceful fails, force termination only terminates the workflow without adding the termination event.

Why?
If we can't repair a corrupted workflow, attempts to update and consequentially repair the workflow would continue to happen unsuccessfully. We need a way to stop these attempts. Terminating it provides a way to do it while not affecting the workflow replayer/shadow logic.

How did you test it?
go test -race -run TestWorkflowRepairer ./service/history/execution/...
go test -race -run TestNewConfig ./service/history/config/...

Local setup test
./cadence-server --zone xdc_cluster0 start
./cadence-server --zone xdc_cluster1 start
./cadence-server --zone xdc_cluster2 start

Register domain
Create workflow
Corrupt workflow by changing for example, next_event_id column and deleting a history_node entry
Verify that the workflow was terminated

Potential risks
There is a potential risk to terminate workflows that should not terminate. The code tries its best to detect real corruptions and to avoid, when a corruption is detected, to terminate workflows when the repair path returned transient errors. This feature is also protected behind a feature flag to allow for gradual deployment.

Release notes
Feature to terminate corrupted workflows that cannot be repaired, with fallback to force-close if proper termination fails, and skip termination on transient errors to avoid data loss.

Documentation Changes
N/A


Summary by Gitar

  • Corruption handling:
    • Added forced termination of workflows when repair fails if EnableCorruptionForcedTermination is enabled
    • Refactored repair logic into attemptRepairByType() for better extensibility
  • New error types:
    • Added ErrWorkflowTerminatedDueToCorruption and ErrRepairAndTerminationFailed error constants
  • Configuration:
    • Added EnableCorruptionForcedTermination dynamic config property (domain-filtered, default false)
  • Metrics:
    • Added WorkflowCorruptionTerminationAttempted, WorkflowCorruptionTerminationSuccess, WorkflowCorruptionTerminationFailure, WorkflowCorruptionForcedCloseAttempted, WorkflowCorruptionForcedCloseSuccess
  • Defaults updated:
    • Changed MutableStateChecksumGenProbability and MutableStateChecksumVerifyProbability defaults from 0 to 100

This will update automatically on new commits.

Comment thread service/history/execution/workflow_repairer.go Outdated
Comment thread common/dynamicconfig/dynamicproperties/constants.go Outdated
Comment thread service/history/execution/workflow_repairer.go Outdated
Comment thread service/history/execution/workflow_repairer.go
Comment thread service/history/execution/workflow_repairer.go
Comment thread service/history/execution/workflow_repairer.go
Comment thread service/history/execution/workflow_repairer.go
Comment thread service/history/execution/workflow_repairer.go
@fimanishi fimanishi force-pushed the terminate-corrupted-workflows-if-no-repair-possible branch from f2c343b to ae12587 Compare April 3, 2026 18:37
Comment thread service/history/execution/workflow_repairer.go
Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
…to transient issues

Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
…llback

Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
…utionTask from failing

Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
@fimanishi fimanishi force-pushed the terminate-corrupted-workflows-if-no-repair-possible branch from 731586e to c4d64c2 Compare April 3, 2026 23:16
@fimanishi fimanishi marked this pull request as ready for review April 6, 2026 23:59
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 7, 2026

CI failed: Test timeout in BatchWorkflowV2 workflow test — the test is blocking on a selector operation and exceeding the 3-second timeout, causing a panic. This appears to be a flaky timing issue in the test suite rather than a deterministic failure related to the workflow termination changes.

Overview

One unique failure pattern detected across 1 log analyzed. A test timeout occurred in the BatchWorkflowV2 workflow test during execution, causing the test suite to fail with exit code 2. The failure is characterized by the workflow being blocked on a selector operation and not completing within the allocated time window.

Failures

BatchWorkflowV2 Test Timeout (confidence: medium)

  • Type: flaky_test
  • Affected jobs: 69905176748
  • Related to change: unclear
  • Root cause: Test BatchWorkflowV2 in workflow_v2.go:109 exceeds the 3-second timeout while blocked on a selector operation. The workflow appears to be waiting for activity results or signals but is not completing within the allocated time. This is a timing-sensitive issue rather than a deterministic logic error.
  • Suggested fix: (1) Increase test timeout duration to accommodate workflow completion latency; (2) Review BatchWorkflowV2 test setup to ensure all mocked activities complete promptly; (3) Check if recent changes to workflow repair logic introduced blocking behavior that delays test completion; (4) Add explicit timeouts or cancellation logic to prevent indefinite blocking in selector operations.

Summary

  • Change-related failures: Unclear — the timeout could be exacerbated by the workflow termination/repair logic changes, but the root cause appears to be a timing-sensitive test issue.
  • Infrastructure/flaky failures: 1 — test timeout in BatchWorkflowV2, likely a pre-existing flaky test rather than a direct consequence of the PR changes.
  • Recommended action: Investigate whether the test timeout is new (introduced by this PR) or pre-existing. If new, profile the workflow execution to identify any blocking or slowdown caused by the repair/termination logic. If pre-existing, consider increasing test timeouts or improving test reliability separately.
Code Review ✅ Approved 9 resolved / 9 findings

Adds logic to terminate corrupted workflows when repair is not possible, addressing nine issues including missing metrics, transient error handling, stale event IDs, and incomplete workflow cleanup. All findings have been resolved.

✅ 9 resolved
Bug: forceCloseWorkflow overwrites VersionHistories with empty data

📄 service/history/execution/workflow_repairer.go:561-575
The forceCloseWorkflow function creates a minimal WorkflowMutation that omits VersionHistories. The serialization layer converts this to an empty DataBlob{} (via ToNilSafeDataBlob), which means the existing VersionHistories in the DB will be overwritten with empty data. While this won't cause a panic, it corrupts replication metadata — any subsequent replication or version-history-dependent operation on this workflow will fail or behave incorrectly. Since this is a last-resort path for already-corrupted workflows, the impact is somewhat mitigated, but it could make post-mortem investigation or replication cleanup harder.

Bug: Checksum default values changed from 0 to 100 — enables globally

📄 common/dynamicconfig/dynamicproperties/constants.go:4136 📄 common/dynamicconfig/dynamicproperties/constants.go:4142
The default values for MutableStateChecksumGenProbability and MutableStateChecksumVerifyProbability were changed from 0 (disabled) to 100 (always enabled). This means every mutable state load will now generate and verify checksums by default across all domains and all deployments. This is a significant behavioral change that:

  1. Increases CPU/latency on every mutable state operation (checksum generation + verification).
  2. Activates the entire corruption-detection pipeline, which was previously opt-in.
  3. Combined with EnableCorruptionForcedTermination, could lead to workflows being terminated if transient checksum mismatches occur.

If this is intentional, it should be called out in the PR description and release notes. If not, it may cause unexpected behavior in production.

Quality: Missing WorkflowCorruptionForcedCloseFailure metric

📄 service/history/execution/workflow_repairer.go:485-488 📄 common/metrics/defs.go:2804-2808
The termination path has a complete set of metrics (Attempted/Success/Failure), but the forced-close fallback path only has Attempted and Success — there is no WorkflowCorruptionForcedCloseFailure metric. When forceCloseWorkflow fails (line 485-488), no specific failure counter is emitted, making it harder to monitor this failure mode in dashboards and alerts.

Bug: isRepairErrorTransient missing ConditionFailedError and ShardOwnershipLostError

📄 service/history/execution/workflow_repairer.go:234-241
The isRepairErrorTransient function only checks for context.DeadlineExceeded, context.Canceled, persistence.TimeoutError, and types.ServiceBusyError. It does not classify persistence.ConditionFailedError or persistence.ShardOwnershipLostError as transient.

ConditionFailedError occurs when a conditional DB write fails due to a NextEventID or RunID mismatch — a transient concurrency conflict that resolves on retry. ShardOwnershipLostError occurs when the shard's RangeID has changed (e.g., after a shard re-acquisition), also a transient condition.

Both errors can be returned during the rebuild path (e.g., from StateRebuilder.Rebuild which reads history, or from UpdateWorkflowExecution if the repair path persists). Since these are not classified as transient, they will cause terminateCorruptedWorkflow to be called — permanently terminating a workflow that could have been repaired on the next retry. This is a data-loss scenario for workflows that are not actually permanently corrupted.

Edge Case: Orphaned history events if UpdateWorkflowExecution fails after append

📄 service/history/execution/workflow_repairer.go:492-506 📄 service/history/execution/workflow_repairer.go:555-569
In terminateWithHistoryEvent, history events are appended first (lines 556-563) and then UpdateWorkflowExecution is called (lines 566-570). These are independent DB operations with no transactional guarantee.

If AppendHistoryV2Events succeeds but UpdateWorkflowExecution fails, the termination event is persisted in history but the execution state is not updated. The code then falls through to forceCloseWorkflow, which writes the execution as terminated without a corresponding history event (it skips history entirely). This creates an inconsistent state: the history tree has an orphaned termination event at an EventID that the execution record doesn't reference, and on the next read the history may appear corrupted (extra event beyond NextEventID).

The normal workflow engine path (context.go) follows the same append-then-update pattern and accepts this risk, so this may be considered acceptable. However, in the force-close fallback you could at least avoid the inconsistency by not falling back to forceCloseWorkflow when history was already appended, or by returning the UpdateWorkflowExecution error directly to let the caller retry (the appended events are idempotent on retry since NextEventID hasn't moved).

...and 4 more resolved from earlier reviews

Rules ❌ No requirements met

Repository Rules

GitHub Issue Linking Requirement: Add an issue link to the PR description in one of the accepted formats (e.g., 'Fixes #XXXX' or 'Addresses cadence-workflow/cadence#XXXX').
PR Description Quality Standards: Add a GitHub issue link to the 'What changed?' section to reference the relevant issue for this feature, as required by the PR template.

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

return r.repairViaRebuild(repairCtx, mutableState, persistedChecksum)
repairErr := r.attemptRepairByType(repairCtx, mutableState, corruptionType, persistedChecksum)
if repairErr != nil {
if r.shard.GetConfig().EnableCorruptionForcedTermination(domainName) && !isRepairErrorTransient(repairErr) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an allowlist of errors instead of denylist to prevent unexpected terminations

return err
}

// Use TransactionPolicyPassive to avoid generating transfer/timer tasks locally.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only avoid generating replication tasks. Besides, we need timer tasks to delete the workflow.

The following comments need to be updated as well.

}

_, err = r.shard.UpdateWorkflowExecution(ctx, &persistence.UpdateWorkflowExecutionRequest{
Mode: persistence.UpdateWorkflowModeIgnoreCurrent,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if the workflow is current workflow, the current execution record is not updated and may prevent new workflow being started

// GetCompletionEvent can return it without a history read. This is the same mechanism used for
// workflows written before the batch ID scheme was introduced.
closeTime := time.Now()
info.CompletionEvent = &types.HistoryEvent{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels risky to use a deprecated field. Can we remove this?

// Used as a last resort when proper termination fails (e.g. history is unreadable).
// expectedNextEventID must be the NextEventID from the DB before any in-memory mutations by
// terminateWithHistoryEvent, so the persistence condition check matches the current DB state.
func (r *workflowRepairerImpl) forceCloseWorkflow(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenario, this case could apply? When history is not readable, it doesn't seem to prevent us from adding termination event, that logic doesn't seem to read history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants