Skip to content

K8SPSMDB-1643: prevent concurrent backup and restore operations#2318

Open
Vinh1507 wants to merge 1 commit intopercona:mainfrom
Vinh1507:K8SPSMDB-1643-fix-concurrent-backup-restore
Open

K8SPSMDB-1643: prevent concurrent backup and restore operations#2318
Vinh1507 wants to merge 1 commit intopercona:mainfrom
Vinh1507:K8SPSMDB-1643-fix-concurrent-backup-restore

Conversation

@Vinh1507
Copy link
Copy Markdown

CHANGE DESCRIPTION

Problem:
When a restore CR is created while a backup is running (or just starting), the operator
blocks indefinitely on "Waiting for restore metadata".

Cause:
The original lock guard only checked for a PITR lock (IsPITRLock). A running backup
holds a different lock type, so the check always returned false during an active
backup and CmdRestore was sent unconditionally. PBM returns a ConcurrentOpError,
writes nothing to pbmRestores, and the operator waits forever.

Solution:
Three guards are applied in sequence in
pkg/controller/perconaservermongodbrestore/logical.go before CmdRestore is sent:

// Guard 1: PITR lock — unchanged from original logic
isBlockedByPITR, err := pbmc.HasLocks(ctx, backup.IsPITRLock)
if isBlockedByPITR {
    log.Info("Waiting for PITR to be disabled.")
    ...
}

// Guard 2: K8s lease — covers the ~1–2 s window at backup startup before pbmLock is acquired,
// and the window after pbmLock is released but before the operator finishes processing completion.
// k8s.IsLeaseActive is added to pkg/k8s/lease.go.
leaseActive, err := k8s.IsLeaseActive(ctx, r.client, naming.BackupLeaseName(cluster.Name), cluster.Namespace)
if leaseActive {
    log.Info("Waiting for active backup to complete before starting restore.")
    ...
}

// Guard 3: PBM lock — catches operations with no K8s lease: concurrent restore CRs or
// manual PBM backups started via CLI.
hasActiveLocks, err := pbmc.HasLocks(ctx)
if hasActiveLocks {
    log.Info("Waiting for active PBM operation to complete.")
    ...
}

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files? — N/A, no new config options introduced.
  • Are all needed new/changed options added to the Helm Chart? — N/A, no new config options introduced.
  • Did we add proper logging messages for operator actions? — Yes: "Waiting for PITR to be disabled.", "Waiting for active backup to complete before starting restore.", "Waiting for active PBM operation to complete."
  • Did we ensure compatibility with the previous version or cluster upgrade process? — Yes, the change is purely additive (extra guards before sending CmdRestore). No API or schema changes.
  • Does the change support oldest and newest supported MongoDB version? — Yes, the fix operates at the operator/K8s level and does not interact with any MongoDB version-specific features.
  • Does the change support oldest and newest supported Kubernetes version? — Yes, uses coordination.k8s.io/v1 Lease which has been stable since Kubernetes 1.14.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 24, 2026

CLA assistant check
All committers have signed the CLA.

@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
arbiter passed 00:11:21
balancer passed 00:19:16
cross-site-sharded passed 00:18:41
custom-replset-name passed 00:10:20
custom-tls passed 00:14:27
custom-users-roles passed 00:10:07
custom-users-roles-sharded passed 00:11:09
data-at-rest-encryption passed 00:13:09
data-sharded passed 00:24:06
demand-backup passed 00:15:30
demand-backup-eks-credentials-irsa passed 00:00:08
demand-backup-fs passed 00:23:54
demand-backup-if-unhealthy passed 00:08:27
demand-backup-incremental-aws passed 00:11:50
demand-backup-incremental-azure passed 00:12:25
demand-backup-incremental-gcp-native passed 00:12:13
demand-backup-incremental-gcp-s3 passed 00:10:44
demand-backup-incremental-minio passed 00:25:36
demand-backup-incremental-sharded-aws passed 00:18:49
demand-backup-incremental-sharded-azure passed 00:17:58
demand-backup-incremental-sharded-gcp-native passed 00:17:26
demand-backup-incremental-sharded-gcp-s3 passed 00:17:41
demand-backup-incremental-sharded-minio passed 00:27:42
demand-backup-logical-minio-native-tls passed 00:08:55
demand-backup-physical-parallel passed 00:08:16
demand-backup-physical-aws passed 00:12:00
demand-backup-physical-azure passed 00:12:02
demand-backup-physical-gcp-s3 passed 00:11:39
demand-backup-physical-gcp-native passed 00:11:56
demand-backup-physical-minio passed 00:20:46
demand-backup-physical-minio-native passed 00:26:08
demand-backup-physical-minio-native-tls passed 00:19:23
demand-backup-physical-sharded-parallel passed 00:11:27
demand-backup-physical-sharded-aws passed 00:18:41
demand-backup-physical-sharded-azure passed 00:17:42
demand-backup-physical-sharded-gcp-native passed 00:17:45
demand-backup-physical-sharded-minio passed 00:17:48
demand-backup-physical-sharded-minio-native passed 00:17:31
demand-backup-sharded passed 00:26:06
demand-backup-snapshot passed 00:37:33
demand-backup-snapshot-vault passed 00:18:22
disabled-auth passed 00:16:24
expose-sharded passed 00:33:40
finalizer passed 00:09:54
ignore-labels-annotations passed 00:07:56
init-deploy passed 00:13:10
ldap passed 00:09:04
ldap-tls passed 00:12:50
limits passed 00:06:13
liveness passed 00:09:20
mongod-major-upgrade passed 00:13:15
mongod-major-upgrade-sharded passed 00:21:15
monitoring-2-0 passed 00:24:58
monitoring-pmm3 passed 00:27:00
multi-cluster-service passed 00:14:06
multi-storage passed 00:19:18
non-voting-and-hidden failure 00:38:37
one-pod passed 00:08:22
operator-self-healing-chaos passed 00:12:18
pitr passed 00:32:49
pitr-physical passed 01:04:57
pitr-sharded passed 00:22:03
pitr-to-new-cluster passed 00:25:42
pitr-physical-backup-source passed 00:54:55
preinit-updates passed 00:05:22
pvc-auto-resize passed 00:13:22
pvc-resize passed 00:17:14
recover-no-primary passed 00:26:13
replset-overrides passed 00:18:18
replset-remapping passed 00:16:27
replset-remapping-sharded passed 00:17:53
rs-shard-migration passed 00:14:25
scaling passed 00:10:48
scheduled-backup passed 00:17:29
security-context passed 00:06:47
self-healing-chaos passed 00:15:00
service-per-pod passed 00:18:59
serviceless-external-nodes passed 00:07:28
smart-update passed 00:08:37
split-horizon passed 00:13:54
stable-resource-version passed 00:04:44
storage passed 00:07:39
tls-issue-cert-manager passed 00:30:09
unsafe-psa passed 00:08:08
upgrade passed 00:09:22
upgrade-consistency passed 00:06:26
upgrade-consistency-sharded-tls passed 00:56:56
upgrade-sharded passed 00:19:58
upgrade-partial-backup passed 00:15:56
users passed 00:17:11
users-vault passed 00:13:18
version-service passed 00:26:10
Summary Value
Tests Run 92/92
Job Duration 03:05:24
Total Test Time 26:49:59

commit: 8c4144e
image: perconalab/percona-server-mongodb-operator:PR-2318-8c4144eb

Copy link
Copy Markdown
Member

@mayankshah1607 mayankshah1607 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix. I believe we need to perform the same checks for physical restore as well

return status, nil
}

hasActiveLocks, err := pbmc.HasLocks(ctx)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a new backup.IsBackupLock predicate like we do with the previous pbmc.HasLocks call

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah otherwise restores would be blocked if oplog slicer for PiTR is running

Comment on lines +64 to +82
leaseActive, err := k8s.IsLeaseActive(ctx, r.client, naming.BackupLeaseName(cluster.Name), cluster.Namespace)
if err != nil {
return status, errors.Wrap(err, "check backup lease")
}
if leaseActive {
log.Info("Waiting for active backup to complete before starting restore.")
status.State = psmdbv1.RestoreStateWaiting
return status, nil
}

hasActiveLocks, err := pbmc.HasLocks(ctx)
if err != nil {
return status, errors.Wrap(err, "checking pbm locks")
}
if hasActiveLocks {
log.Info("Waiting for active PBM operation to complete.")
status.State = psmdbv1.RestoreStateWaiting
return status, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like @mayankshah1607 said, this should be moved to high level restore Reconcile function so it applies to all kinds of restores

return status, nil
}

hasActiveLocks, err := pbmc.HasLocks(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah otherwise restores would be blocked if oplog slicer for PiTR is running

return status, errors.Wrap(err, "check backup lease")
}
if leaseActive {
log.Info("Waiting for active backup to complete before starting restore.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the lease name in this log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M 30-99 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants