Report
Summary
When a new backup is in progress and a restore (from a previously completed backup) is created concurrently, both operations get stuck permanently. The restore enters an infinite Waiting for restore metadata loop, and the in-progress backup is incorrectly demoted to Waiting state, losing tracking of the underlying PBM agent that is still running.
Environment
- percona-server-mongodb-operator
- Affected controllers:
perconaservermongodbbackup, perconaservermongodbrestore
Steps to Reproduce
- Have a completed backup
backup-1 (state: Ready).
- Create a new backup
backup-2 — wait ~1 second until its PBM agent starts and acquires the PBM lock.
- While
backup-2 is still running, create restore-1 referencing backup-1.
Observed Behavior
restore-1 gets stuck in state Requested, logging Waiting for restore metadata indefinitely every 5 seconds.
backup-2 is demoted from Running/Requested to Waiting state, even though the PBM agent is still actively uploading data. When the PBM agent eventually finishes, the operator never learns about it.
Expected Behavior
restore-1 should wait (state Waiting) until no active PBM backup lock is held, then proceed.
backup-2 should not have its operator state overridden while the underlying PBM operation is running.
Root Cause Analysis
Bug 1 — Restore sends CmdRestore while a backup PBM lock is held
In reconcileLogicalRestore (logical.go:51), before dispatching CmdRestore, the controller only checks for an active PITR lock:
isBlockedByPITR, err := pbmc.HasLocks(ctx, backup.IsPITRLock)
There is no check for an active CmdBackup lock. Because backup-1 (the source of the restore) is already in Ready state, the restore passes the bcp.Status.State == BackupStateReady guard and dispatches CmdRestore immediately.
On the PBM agent side, a.Restore() runs synchronously and attempts to acquire a PBM lock. Since backup-2 already holds pbmLock(CmdBackup), the restore operation fails to acquire its lock and exits without writing any restore metadata to pbmRestores. The CmdRestore command is consumed and never retried.
On the next reconcile, GetRestoreMeta(cr.Status.PBMname) returns nil indefinitely:
if meta == nil || meta.Name == "" {
log.Info("Waiting for restore metadata", ...)
return status, nil // state stays Requested, loops every 5s forever
}
No timeout or exit condition exists for the nil-metadata case — the restore stays in Requested state forever.
Bug 2 — In-progress backup state is overridden to Waiting by a concurrent restore CR
HasActiveJobs in the backup controller checks the K8s restore CR list first (backup.go:58-64):
if r.Status.State != RestoreStateReady &&
r.Status.State != RestoreStateError &&
r.Status.State != RestoreStateWaiting {
return true, nil // restore-1 is Requested → triggers this
}
RestoreStateRequested is not in the exclusion list, so the mere presence of restore-1 (stuck in Requested) causes HasActiveJobs to return true for backup-2. The backup reconciler then unconditionally overrides the backup state:
if cjobs {
status.State = psmdbv1.BackupStateWaiting
return status, nil
}
This downgrades backup-2 from Running/Requested to Waiting — permanently, since restore-1 never exits its stuck Requested state. The PBM agent finishes the backup successfully in the background, but the operator never polls GetBackupMeta again to record that result.
Impact
The two bugs reinforce each other into a permanent deadlock:
restore-1 is stuck because it sent CmdRestore while a backup lock was active.
backup-2 is stuck because HasActiveJobs keeps returning true due to restore-1 being stuck.
- Neither operation ever transitions to
Ready or Error without manual intervention.
Proposed Fixes
Fix — Block restore dispatch when a non-PITR PBM lock is active
In reconcileLogicalRestore, before calling runRestore, add a check for any active backup/restore PBM lock:
isBlocked, err := pbmc.HasLocks(ctx, backup.NotPITRLock)
if err != nil {
return status, errors.Wrap(err, "checking pbm locks")
}
if isBlocked {
log.Info("Waiting for active PBM operation to complete before starting restore.")
status.State = psmdbv1.RestoreStateWaiting
return status, nil
}
This ensures CmdRestore is only dispatched when no conflicting PBM lock is held, preventing the unrecoverable nil-metadata state.
Versions
- Kubernetes
- Operator 1.22.0
- Database: Mongodb
Anything else?
No response
Report
Summary
When a new backup is in progress and a restore (from a previously completed backup) is created concurrently, both operations get stuck permanently. The restore enters an infinite
Waiting for restore metadataloop, and the in-progress backup is incorrectly demoted toWaitingstate, losing tracking of the underlying PBM agent that is still running.Environment
perconaservermongodbbackup,perconaservermongodbrestoreSteps to Reproduce
backup-1(state:Ready).backup-2— wait ~1 second until its PBM agent starts and acquires the PBM lock.backup-2is still running, createrestore-1referencingbackup-1.Observed Behavior
restore-1gets stuck in stateRequested, loggingWaiting for restore metadataindefinitely every 5 seconds.backup-2is demoted fromRunning/RequestedtoWaitingstate, even though the PBM agent is still actively uploading data. When the PBM agent eventually finishes, the operator never learns about it.Expected Behavior
restore-1should wait (stateWaiting) until no active PBM backup lock is held, then proceed.backup-2should not have its operator state overridden while the underlying PBM operation is running.Root Cause Analysis
Bug 1 — Restore sends
CmdRestorewhile a backup PBM lock is heldIn
reconcileLogicalRestore(logical.go:51), before dispatchingCmdRestore, the controller only checks for an active PITR lock:There is no check for an active
CmdBackuplock. Becausebackup-1(the source of the restore) is already inReadystate, the restore passes thebcp.Status.State == BackupStateReadyguard and dispatchesCmdRestoreimmediately.On the PBM agent side,
a.Restore()runs synchronously and attempts to acquire a PBM lock. Sincebackup-2already holdspbmLock(CmdBackup), the restore operation fails to acquire its lock and exits without writing any restore metadata topbmRestores. TheCmdRestorecommand is consumed and never retried.On the next reconcile,
GetRestoreMeta(cr.Status.PBMname)returnsnilindefinitely:No timeout or exit condition exists for the nil-metadata case — the restore stays in
Requestedstate forever.Bug 2 — In-progress backup state is overridden to
Waitingby a concurrent restore CRHasActiveJobsin the backup controller checks the K8s restore CR list first (backup.go:58-64):RestoreStateRequestedis not in the exclusion list, so the mere presence ofrestore-1(stuck inRequested) causesHasActiveJobsto returntrueforbackup-2. The backup reconciler then unconditionally overrides the backup state:This downgrades
backup-2fromRunning/RequestedtoWaiting— permanently, sincerestore-1never exits its stuckRequestedstate. The PBM agent finishes the backup successfully in the background, but the operator never pollsGetBackupMetaagain to record that result.Impact
The two bugs reinforce each other into a permanent deadlock:
restore-1is stuck because it sentCmdRestorewhile a backup lock was active.backup-2is stuck becauseHasActiveJobskeeps returningtruedue torestore-1being stuck.ReadyorErrorwithout manual intervention.Proposed Fixes
Fix — Block restore dispatch when a non-PITR PBM lock is active
In
reconcileLogicalRestore, before callingrunRestore, add a check for any active backup/restore PBM lock:This ensures
CmdRestoreis only dispatched when no conflicting PBM lock is held, preventing the unrecoverable nil-metadata state.Versions
Anything else?
No response