K8SPSMDB-1643: prevent concurrent backup and restore operations by Vinh1507 · Pull Request #2318 · percona/percona-server-mongodb-operator

Vinh1507 · 2026-04-24T04:03:45Z

CHANGE DESCRIPTION

Problem:
When a restore CR is created while a backup is running (or just starting), the operator
blocks indefinitely on "Waiting for restore metadata".

Cause:
The original lock guard only checked for a PITR lock (IsPITRLock). A running backup
holds a different lock type, so the check always returned false during an active
backup and CmdRestore was sent unconditionally. PBM returns a ConcurrentOpError,
writes nothing to pbmRestores, and the operator waits forever.

Solution:
Three guards are applied in sequence in
pkg/controller/perconaservermongodbrestore/logical.go before CmdRestore is sent:

// Guard 1: PITR lock — unchanged from original logic
isBlockedByPITR, err := pbmc.HasLocks(ctx, backup.IsPITRLock)
if isBlockedByPITR {
    log.Info("Waiting for PITR to be disabled.")
    ...
}

// Guard 2: K8s lease — covers the ~1–2 s window at backup startup before pbmLock is acquired,
// and the window after pbmLock is released but before the operator finishes processing completion.
// k8s.IsLeaseActive is added to pkg/k8s/lease.go.
leaseActive, err := k8s.IsLeaseActive(ctx, r.client, naming.BackupLeaseName(cluster.Name), cluster.Namespace)
if leaseActive {
    log.Info("Waiting for active backup to complete before starting restore.")
    ...
}

// Guard 3: PBM lock — catches operations with no K8s lease: concurrent restore CRs or
// manual PBM backups started via CLI.
hasActiveLocks, err := pbmc.HasLocks(ctx)
if hasActiveLocks {
    log.Info("Waiting for active PBM operation to complete.")
    ...
}

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files? — N/A, no new config options introduced.
Are all needed new/changed options added to the Helm Chart? — N/A, no new config options introduced.
Did we add proper logging messages for operator actions? — Yes: "Waiting for PITR to be disabled.", "Waiting for active backup to complete before starting restore.", "Waiting for active PBM operation to complete."
Did we ensure compatibility with the previous version or cluster upgrade process? — Yes, the change is purely additive (extra guards before sending CmdRestore). No API or schema changes.
Does the change support oldest and newest supported MongoDB version? — Yes, the fix operates at the operator/K8s level and does not interact with any MongoDB version-specific features.
Does the change support oldest and newest supported Kubernetes version? — Yes, uses coordination.k8s.io/v1 Lease which has been stable since Kubernetes 1.14.

CLAassistant · 2026-04-24T04:04:10Z

All committers have signed the CLA.

JNKPercona · 2026-04-24T07:10:52Z

Test Name	Result	Time
arbiter	passed	00:11:21
balancer	passed	00:19:16
cross-site-sharded	passed	00:18:41
custom-replset-name	passed	00:10:20
custom-tls	passed	00:14:27
custom-users-roles	passed	00:10:07
custom-users-roles-sharded	passed	00:11:09
data-at-rest-encryption	passed	00:13:09
data-sharded	passed	00:24:06
demand-backup	passed	00:15:30
demand-backup-eks-credentials-irsa	passed	00:00:08
demand-backup-fs	passed	00:23:54
demand-backup-if-unhealthy	passed	00:08:27
demand-backup-incremental-aws	passed	00:11:50
demand-backup-incremental-azure	passed	00:12:25
demand-backup-incremental-gcp-native	passed	00:12:13
demand-backup-incremental-gcp-s3	passed	00:10:44
demand-backup-incremental-minio	passed	00:25:36
demand-backup-incremental-sharded-aws	passed	00:18:49
demand-backup-incremental-sharded-azure	passed	00:17:58
demand-backup-incremental-sharded-gcp-native	passed	00:17:26
demand-backup-incremental-sharded-gcp-s3	passed	00:17:41
demand-backup-incremental-sharded-minio	passed	00:27:42
demand-backup-logical-minio-native-tls	passed	00:08:55
demand-backup-physical-parallel	passed	00:08:16
demand-backup-physical-aws	passed	00:12:00
demand-backup-physical-azure	passed	00:12:02
demand-backup-physical-gcp-s3	passed	00:11:39
demand-backup-physical-gcp-native	passed	00:11:56
demand-backup-physical-minio	passed	00:20:46
demand-backup-physical-minio-native	passed	00:26:08
demand-backup-physical-minio-native-tls	passed	00:19:23
demand-backup-physical-sharded-parallel	passed	00:11:27
demand-backup-physical-sharded-aws	passed	00:18:41
demand-backup-physical-sharded-azure	passed	00:17:42
demand-backup-physical-sharded-gcp-native	passed	00:17:45
demand-backup-physical-sharded-minio	passed	00:17:48
demand-backup-physical-sharded-minio-native	passed	00:17:31
demand-backup-sharded	passed	00:26:06
demand-backup-snapshot	passed	00:37:33
demand-backup-snapshot-vault	passed	00:18:22
disabled-auth	passed	00:16:24
expose-sharded	passed	00:33:40
finalizer	passed	00:09:54
ignore-labels-annotations	passed	00:07:56
init-deploy	passed	00:13:10
ldap	passed	00:09:04
ldap-tls	passed	00:12:50
limits	passed	00:06:13
liveness	passed	00:09:20
mongod-major-upgrade	passed	00:13:15
mongod-major-upgrade-sharded	passed	00:21:15
monitoring-2-0	passed	00:24:58
monitoring-pmm3	passed	00:27:00
multi-cluster-service	passed	00:14:06
multi-storage	passed	00:19:18
non-voting-and-hidden	failure	00:38:37
one-pod	passed	00:08:22
operator-self-healing-chaos	passed	00:12:18
pitr	passed	00:32:49
pitr-physical	passed	01:04:57
pitr-sharded	passed	00:22:03
pitr-to-new-cluster	passed	00:25:42
pitr-physical-backup-source	passed	00:54:55
preinit-updates	passed	00:05:22
pvc-auto-resize	passed	00:13:22
pvc-resize	passed	00:17:14
recover-no-primary	passed	00:26:13
replset-overrides	passed	00:18:18
replset-remapping	passed	00:16:27
replset-remapping-sharded	passed	00:17:53
rs-shard-migration	passed	00:14:25
scaling	passed	00:10:48
scheduled-backup	passed	00:17:29
security-context	passed	00:06:47
self-healing-chaos	passed	00:15:00
service-per-pod	passed	00:18:59
serviceless-external-nodes	passed	00:07:28
smart-update	passed	00:08:37
split-horizon	passed	00:13:54
stable-resource-version	passed	00:04:44
storage	passed	00:07:39
tls-issue-cert-manager	passed	00:30:09
unsafe-psa	passed	00:08:08
upgrade	passed	00:09:22
upgrade-consistency	passed	00:06:26
upgrade-consistency-sharded-tls	passed	00:56:56
upgrade-sharded	passed	00:19:58
upgrade-partial-backup	passed	00:15:56
users	passed	00:17:11
users-vault	passed	00:13:18
version-service	passed	00:26:10

Summary	Value
Tests Run	92/92
Job Duration	03:05:24
Total Test Time	26:49:59

commit: 8c4144e
image: perconalab/percona-server-mongodb-operator:PR-2318-8c4144eb

mayankshah1607

Thanks for the quick fix. I believe we need to perform the same checks for physical restore as well

mayankshah1607 · 2026-04-24T08:06:32Z

+			return status, nil
+		}
+
+		hasActiveLocks, err := pbmc.HasLocks(ctx)


Lets add a new backup.IsBackupLock predicate like we do with the previous pbmc.HasLocks call

yeah otherwise restores would be blocked if oplog slicer for PiTR is running

egegunes · 2026-04-24T09:28:37Z

+		leaseActive, err := k8s.IsLeaseActive(ctx, r.client, naming.BackupLeaseName(cluster.Name), cluster.Namespace)
+		if err != nil {
+			return status, errors.Wrap(err, "check backup lease")
+		}
+		if leaseActive {
+			log.Info("Waiting for active backup to complete before starting restore.")
+			status.State = psmdbv1.RestoreStateWaiting
+			return status, nil
+		}
+
+		hasActiveLocks, err := pbmc.HasLocks(ctx)
+		if err != nil {
+			return status, errors.Wrap(err, "checking pbm locks")
+		}
+		if hasActiveLocks {
+			log.Info("Waiting for active PBM operation to complete.")
+			status.State = psmdbv1.RestoreStateWaiting
+			return status, nil
+		}


like @mayankshah1607 said, this should be moved to high level restore Reconcile function so it applies to all kinds of restores

egegunes · 2026-04-24T09:29:14Z

+			return status, nil
+		}
+
+		hasActiveLocks, err := pbmc.HasLocks(ctx)


yeah otherwise restores would be blocked if oplog slicer for PiTR is running

egegunes · 2026-04-24T09:29:41Z

+			return status, errors.Wrap(err, "check backup lease")
+		}
+		if leaseActive {
+			log.Info("Waiting for active backup to complete before starting restore.")


I would put the lease name in this log

K8SPSMDB-1643: prevent concurrent backup and restore operations

8c4144e

Vinh1507 requested review from egegunes, gkech, hors, mayankshah1607, nmarukovich, oksana-grishchenko and pooknull as code owners April 24, 2026 04:03

pull-request-size Bot added the size/M 30-99 lines label Apr 24, 2026

Vinh1507 mentioned this pull request Apr 24, 2026

Deadlock: Concurrent backup and restore operations cause both to hang indefinitely #2317

Open

mayankshah1607 requested changes Apr 24, 2026

View reviewed changes

egegunes requested changes Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SPSMDB-1643: prevent concurrent backup and restore operations#2318

K8SPSMDB-1643: prevent concurrent backup and restore operations#2318
Vinh1507 wants to merge 1 commit intopercona:mainfrom
Vinh1507:K8SPSMDB-1643-fix-concurrent-backup-restore

Vinh1507 commented Apr 24, 2026

Uh oh!

CLAassistant commented Apr 24, 2026 •

edited

Loading

Uh oh!

JNKPercona commented Apr 24, 2026

Uh oh!

mayankshah1607 left a comment

Uh oh!

mayankshah1607 Apr 24, 2026

Uh oh!

egegunes Apr 24, 2026

Uh oh!

egegunes Apr 24, 2026

Uh oh!

egegunes Apr 24, 2026

Uh oh!

egegunes Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Vinh1507 commented Apr 24, 2026

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

CLAassistant commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JNKPercona commented Apr 24, 2026

Uh oh!

mayankshah1607 left a comment

Choose a reason for hiding this comment

Uh oh!

mayankshah1607 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

egegunes Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

egegunes Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

egegunes Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

egegunes Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Apr 24, 2026 •

edited

Loading