Skip to content

Deadlock: Concurrent backup and restore operations cause both to hang indefinitely #2317

@Vinh1507

Description

@Vinh1507

Report

Summary

When a new backup is in progress and a restore (from a previously completed backup) is created concurrently, both operations get stuck permanently. The restore enters an infinite Waiting for restore metadata loop, and the in-progress backup is incorrectly demoted to Waiting state, losing tracking of the underlying PBM agent that is still running.

Environment

  • percona-server-mongodb-operator
  • Affected controllers: perconaservermongodbbackup, perconaservermongodbrestore

Steps to Reproduce

  1. Have a completed backup backup-1 (state: Ready).
  2. Create a new backup backup-2 — wait ~1 second until its PBM agent starts and acquires the PBM lock.
  3. While backup-2 is still running, create restore-1 referencing backup-1.

Observed Behavior

  • restore-1 gets stuck in state Requested, logging Waiting for restore metadata indefinitely every 5 seconds.
  • backup-2 is demoted from Running/Requested to Waiting state, even though the PBM agent is still actively uploading data. When the PBM agent eventually finishes, the operator never learns about it.

Expected Behavior

  • restore-1 should wait (state Waiting) until no active PBM backup lock is held, then proceed.
  • backup-2 should not have its operator state overridden while the underlying PBM operation is running.

Root Cause Analysis

Bug 1 — Restore sends CmdRestore while a backup PBM lock is held

In reconcileLogicalRestore (logical.go:51), before dispatching CmdRestore, the controller only checks for an active PITR lock:

isBlockedByPITR, err := pbmc.HasLocks(ctx, backup.IsPITRLock)

There is no check for an active CmdBackup lock. Because backup-1 (the source of the restore) is already in Ready state, the restore passes the bcp.Status.State == BackupStateReady guard and dispatches CmdRestore immediately.

On the PBM agent side, a.Restore() runs synchronously and attempts to acquire a PBM lock. Since backup-2 already holds pbmLock(CmdBackup), the restore operation fails to acquire its lock and exits without writing any restore metadata to pbmRestores. The CmdRestore command is consumed and never retried.

On the next reconcile, GetRestoreMeta(cr.Status.PBMname) returns nil indefinitely:

if meta == nil || meta.Name == "" {
    log.Info("Waiting for restore metadata", ...)
    return status, nil  // state stays Requested, loops every 5s forever
}

No timeout or exit condition exists for the nil-metadata case — the restore stays in Requested state forever.

Bug 2 — In-progress backup state is overridden to Waiting by a concurrent restore CR

HasActiveJobs in the backup controller checks the K8s restore CR list first (backup.go:58-64):

if r.Status.State != RestoreStateReady &&
   r.Status.State != RestoreStateError &&
   r.Status.State != RestoreStateWaiting {
    return true, nil  // restore-1 is Requested → triggers this
}

RestoreStateRequested is not in the exclusion list, so the mere presence of restore-1 (stuck in Requested) causes HasActiveJobs to return true for backup-2. The backup reconciler then unconditionally overrides the backup state:

if cjobs {
    status.State = psmdbv1.BackupStateWaiting
    return status, nil
}

This downgrades backup-2 from Running/Requested to Waiting — permanently, since restore-1 never exits its stuck Requested state. The PBM agent finishes the backup successfully in the background, but the operator never polls GetBackupMeta again to record that result.

Impact

The two bugs reinforce each other into a permanent deadlock:

  • restore-1 is stuck because it sent CmdRestore while a backup lock was active.
  • backup-2 is stuck because HasActiveJobs keeps returning true due to restore-1 being stuck.
  • Neither operation ever transitions to Ready or Error without manual intervention.

Proposed Fixes

Fix — Block restore dispatch when a non-PITR PBM lock is active

In reconcileLogicalRestore, before calling runRestore, add a check for any active backup/restore PBM lock:

isBlocked, err := pbmc.HasLocks(ctx, backup.NotPITRLock)
if err != nil {
    return status, errors.Wrap(err, "checking pbm locks")
}
if isBlocked {
    log.Info("Waiting for active PBM operation to complete before starting restore.")
    status.State = psmdbv1.RestoreStateWaiting
    return status, nil
}

This ensures CmdRestore is only dispatched when no conflicting PBM lock is held, preventing the unrecoverable nil-metadata state.

Versions

  1. Kubernetes
  2. Operator 1.22.0
  3. Database: Mongodb

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions