feat: Terminate corrupted workflow if repair is not possible#7878
feat: Terminate corrupted workflow if repair is not possible#7878fimanishi wants to merge 9 commits intocadence-workflow:masterfrom
Conversation
f2c343b to
ae12587
Compare
Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
…to transient issues Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
…llback Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
…utionTask from failing Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
731586e to
c4d64c2
Compare
CI failed: Test timeout in BatchWorkflowV2 workflow test — the test is blocking on a selector operation and exceeding the 3-second timeout, causing a panic. This appears to be a flaky timing issue in the test suite rather than a deterministic failure related to the workflow termination changes.OverviewOne unique failure pattern detected across 1 log analyzed. A test timeout occurred in the BatchWorkflowV2 workflow test during execution, causing the test suite to fail with exit code 2. The failure is characterized by the workflow being blocked on a selector operation and not completing within the allocated time window. FailuresBatchWorkflowV2 Test Timeout (confidence: medium)
Summary
Code Review ✅ Approved 9 resolved / 9 findingsAdds logic to terminate corrupted workflows when repair is not possible, addressing nine issues including missing metrics, transient error handling, stale event IDs, and incomplete workflow cleanup. All findings have been resolved. ✅ 9 resolved✅ Bug: forceCloseWorkflow overwrites VersionHistories with empty data
✅ Bug: Checksum default values changed from 0 to 100 — enables globally
✅ Quality: Missing WorkflowCorruptionForcedCloseFailure metric
✅ Bug: isRepairErrorTransient missing ConditionFailedError and ShardOwnershipLostError
✅ Edge Case: Orphaned history events if UpdateWorkflowExecution fails after append
...and 4 more resolved from earlier reviews Rules ❌ No requirements metRepository Rules
1 rule not applicable. Show all rules by commenting Tip Comment OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| return r.repairViaRebuild(repairCtx, mutableState, persistedChecksum) | ||
| repairErr := r.attemptRepairByType(repairCtx, mutableState, corruptionType, persistedChecksum) | ||
| if repairErr != nil { | ||
| if r.shard.GetConfig().EnableCorruptionForcedTermination(domainName) && !isRepairErrorTransient(repairErr) { |
There was a problem hiding this comment.
use an allowlist of errors instead of denylist to prevent unexpected terminations
| return err | ||
| } | ||
|
|
||
| // Use TransactionPolicyPassive to avoid generating transfer/timer tasks locally. |
There was a problem hiding this comment.
It only avoid generating replication tasks. Besides, we need timer tasks to delete the workflow.
The following comments need to be updated as well.
| } | ||
|
|
||
| _, err = r.shard.UpdateWorkflowExecution(ctx, &persistence.UpdateWorkflowExecutionRequest{ | ||
| Mode: persistence.UpdateWorkflowModeIgnoreCurrent, |
There was a problem hiding this comment.
Note: if the workflow is current workflow, the current execution record is not updated and may prevent new workflow being started
| // GetCompletionEvent can return it without a history read. This is the same mechanism used for | ||
| // workflows written before the batch ID scheme was introduced. | ||
| closeTime := time.Now() | ||
| info.CompletionEvent = &types.HistoryEvent{ |
There was a problem hiding this comment.
It feels risky to use a deprecated field. Can we remove this?
| // Used as a last resort when proper termination fails (e.g. history is unreadable). | ||
| // expectedNextEventID must be the NextEventID from the DB before any in-memory mutations by | ||
| // terminateWithHistoryEvent, so the persistence condition check matches the current DB state. | ||
| func (r *workflowRepairerImpl) forceCloseWorkflow( |
There was a problem hiding this comment.
In what scenario, this case could apply? When history is not readable, it doesn't seem to prevent us from adding termination event, that logic doesn't seem to read history.
What changed?
Add workflow termination logic to terminate workflow if corruption is detected but repair fails due to non transient errors. There are two termination modes. Graceful is attempted first. It tries to add the termination event before terminating the workflow. If graceful fails, force termination only terminates the workflow without adding the termination event.
Why?
If we can't repair a corrupted workflow, attempts to update and consequentially repair the workflow would continue to happen unsuccessfully. We need a way to stop these attempts. Terminating it provides a way to do it while not affecting the workflow replayer/shadow logic.
How did you test it?
go test -race -run TestWorkflowRepairer ./service/history/execution/...
go test -race -run TestNewConfig ./service/history/config/...
Local setup test
./cadence-server --zone xdc_cluster0 start
./cadence-server --zone xdc_cluster1 start
./cadence-server --zone xdc_cluster2 start
Register domain
Create workflow
Corrupt workflow by changing for example, next_event_id column and deleting a history_node entry
Verify that the workflow was terminated
Potential risks
There is a potential risk to terminate workflows that should not terminate. The code tries its best to detect real corruptions and to avoid, when a corruption is detected, to terminate workflows when the repair path returned transient errors. This feature is also protected behind a feature flag to allow for gradual deployment.
Release notes
Feature to terminate corrupted workflows that cannot be repaired, with fallback to force-close if proper termination fails, and skip termination on transient errors to avoid data loss.
Documentation Changes
N/A
Summary by Gitar
EnableCorruptionForcedTerminationis enabledattemptRepairByType()for better extensibilityErrWorkflowTerminatedDueToCorruptionandErrRepairAndTerminationFailederror constantsEnableCorruptionForcedTerminationdynamic config property (domain-filtered, default false)WorkflowCorruptionTerminationAttempted,WorkflowCorruptionTerminationSuccess,WorkflowCorruptionTerminationFailure,WorkflowCorruptionForcedCloseAttempted,WorkflowCorruptionForcedCloseSuccessMutableStateChecksumGenProbabilityandMutableStateChecksumVerifyProbabilitydefaults from 0 to 100This will update automatically on new commits.