Deadlock: Concurrent backup and restore operations cause both to hang indefinitely

### Report

### Summary

When a new backup is in progress and a restore (from a previously completed backup) is created concurrently, both operations get stuck permanently. The restore enters an infinite `Waiting for restore metadata` loop, and the in-progress backup is incorrectly demoted to `Waiting` state, losing tracking of the underlying PBM agent that is still running.

### Environment

- percona-server-mongodb-operator
- Affected controllers: `perconaservermongodbbackup`, `perconaservermongodbrestore`

### Steps to Reproduce

1. Have a completed backup `backup-1` (state: `Ready`).
2. Create a new backup `backup-2` — wait ~1 second until its PBM agent starts and acquires the PBM lock.
3. While `backup-2` is still running, create `restore-1` referencing `backup-1`.

### Observed Behavior

- `restore-1` gets stuck in state `Requested`, logging `Waiting for restore metadata` indefinitely every 5 seconds.
- `backup-2` is demoted from `Running`/`Requested` to `Waiting` state, even though the PBM agent is still actively uploading data. When the PBM agent eventually finishes, the operator never learns about it.

### Expected Behavior

- `restore-1` should wait (state `Waiting`) until no active PBM backup lock is held, then proceed.
- `backup-2` should not have its operator state overridden while the underlying PBM operation is running.

### Root Cause Analysis

**Bug 1 — Restore sends `CmdRestore` while a backup PBM lock is held**

In `reconcileLogicalRestore` (logical.go:51), before dispatching `CmdRestore`, the controller only checks for an active PITR lock:

```go
isBlockedByPITR, err := pbmc.HasLocks(ctx, backup.IsPITRLock)
```

There is no check for an active `CmdBackup` lock. Because `backup-1` (the source of the restore) is already in `Ready` state, the restore passes the `bcp.Status.State == BackupStateReady` guard and dispatches `CmdRestore` immediately.

On the PBM agent side, `a.Restore()` runs synchronously and attempts to acquire a PBM lock. Since `backup-2` already holds `pbmLock(CmdBackup)`, the restore operation fails to acquire its lock and exits without writing any restore metadata to `pbmRestores`. The `CmdRestore` command is consumed and never retried.

On the next reconcile, `GetRestoreMeta(cr.Status.PBMname)` returns `nil` indefinitely:

```go
if meta == nil || meta.Name == "" {
    log.Info("Waiting for restore metadata", ...)
    return status, nil  // state stays Requested, loops every 5s forever
}
```

No timeout or exit condition exists for the nil-metadata case — the restore stays in `Requested` state forever.

**Bug 2 — In-progress backup state is overridden to `Waiting` by a concurrent restore CR**

`HasActiveJobs` in the backup controller checks the K8s restore CR list first (backup.go:58-64):

```go
if r.Status.State != RestoreStateReady &&
   r.Status.State != RestoreStateError &&
   r.Status.State != RestoreStateWaiting {
    return true, nil  // restore-1 is Requested → triggers this
}
```

`RestoreStateRequested` is not in the exclusion list, so the mere presence of `restore-1` (stuck in `Requested`) causes `HasActiveJobs` to return `true` for `backup-2`. The backup reconciler then unconditionally overrides the backup state:

```go
if cjobs {
    status.State = psmdbv1.BackupStateWaiting
    return status, nil
}
```

This downgrades `backup-2` from `Running`/`Requested` to `Waiting` — permanently, since `restore-1` never exits its stuck `Requested` state. The PBM agent finishes the backup successfully in the background, but the operator never polls `GetBackupMeta` again to record that result.

### Impact

The two bugs reinforce each other into a permanent deadlock:

- `restore-1` is stuck because it sent `CmdRestore` while a backup lock was active.
- `backup-2` is stuck because `HasActiveJobs` keeps returning `true` due to `restore-1` being stuck.
- Neither operation ever transitions to `Ready` or `Error` without manual intervention.

### Proposed Fixes

**Fix — Block restore dispatch when a non-PITR PBM lock is active**

In `reconcileLogicalRestore`, before calling `runRestore`, add a check for any active backup/restore PBM lock:

```go
isBlocked, err := pbmc.HasLocks(ctx, backup.NotPITRLock)
if err != nil {
    return status, errors.Wrap(err, "checking pbm locks")
}
if isBlocked {
    log.Info("Waiting for active PBM operation to complete before starting restore.")
    status.State = psmdbv1.RestoreStateWaiting
    return status, nil
}
```

This ensures `CmdRestore` is only dispatched when no conflicting PBM lock is held, preventing the unrecoverable nil-metadata state.


### Versions

1. Kubernetes
2. Operator 1.22.0
3. Database: Mongodb


### Anything else?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock: Concurrent backup and restore operations cause both to hang indefinitely #2317

Report

Summary

Environment

Steps to Reproduce

Observed Behavior

Expected Behavior

Root Cause Analysis

Impact

Proposed Fixes

Versions

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deadlock: Concurrent backup and restore operations cause both to hang indefinitely #2317

Description

Report

Summary

Environment

Steps to Reproduce

Observed Behavior

Expected Behavior

Root Cause Analysis

Impact

Proposed Fixes

Versions

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions