K8SPSMDB-1643: prevent concurrent backup and restore operations#2318
K8SPSMDB-1643: prevent concurrent backup and restore operations#2318Vinh1507 wants to merge 1 commit intopercona:mainfrom
Conversation
commit: 8c4144e |
mayankshah1607
left a comment
There was a problem hiding this comment.
Thanks for the quick fix. I believe we need to perform the same checks for physical restore as well
| return status, nil | ||
| } | ||
|
|
||
| hasActiveLocks, err := pbmc.HasLocks(ctx) |
There was a problem hiding this comment.
Lets add a new backup.IsBackupLock predicate like we do with the previous pbmc.HasLocks call
There was a problem hiding this comment.
yeah otherwise restores would be blocked if oplog slicer for PiTR is running
| leaseActive, err := k8s.IsLeaseActive(ctx, r.client, naming.BackupLeaseName(cluster.Name), cluster.Namespace) | ||
| if err != nil { | ||
| return status, errors.Wrap(err, "check backup lease") | ||
| } | ||
| if leaseActive { | ||
| log.Info("Waiting for active backup to complete before starting restore.") | ||
| status.State = psmdbv1.RestoreStateWaiting | ||
| return status, nil | ||
| } | ||
|
|
||
| hasActiveLocks, err := pbmc.HasLocks(ctx) | ||
| if err != nil { | ||
| return status, errors.Wrap(err, "checking pbm locks") | ||
| } | ||
| if hasActiveLocks { | ||
| log.Info("Waiting for active PBM operation to complete.") | ||
| status.State = psmdbv1.RestoreStateWaiting | ||
| return status, nil | ||
| } |
There was a problem hiding this comment.
like @mayankshah1607 said, this should be moved to high level restore Reconcile function so it applies to all kinds of restores
| return status, nil | ||
| } | ||
|
|
||
| hasActiveLocks, err := pbmc.HasLocks(ctx) |
There was a problem hiding this comment.
yeah otherwise restores would be blocked if oplog slicer for PiTR is running
| return status, errors.Wrap(err, "check backup lease") | ||
| } | ||
| if leaseActive { | ||
| log.Info("Waiting for active backup to complete before starting restore.") |
There was a problem hiding this comment.
I would put the lease name in this log
CHANGE DESCRIPTION
Problem:
When a restore CR is created while a backup is running (or just starting), the operator
blocks indefinitely on "Waiting for restore metadata".
Cause:
The original lock guard only checked for a PITR lock (
IsPITRLock). A running backupholds a different lock type, so the check always returned
falseduring an activebackup and
CmdRestorewas sent unconditionally. PBM returns aConcurrentOpError,writes nothing to
pbmRestores, and the operator waits forever.Solution:
Three guards are applied in sequence in
pkg/controller/perconaservermongodbrestore/logical.gobeforeCmdRestoreis sent:CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
compare/*-oc.yml)?Config/Logging/Testability
coordination.k8s.io/v1Lease which has been stable since Kubernetes 1.14.