Skip to content

Race condition in deleteListenerPod causes intermittent FailedMount errors #4450

@Okabe-Junya

Description

@Okabe-Junya

Checks

Controller Version

0.13.1 (affects any version that includes #4033)

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy ARC with `AutoscalingRunnerSet`
2. Wait for listener pods to terminate and restart due to normal lifecycle events (eviction, OOM, completed, etc.)
3. Observe intermittent `FailedMount` errors or `no such file or directory` on `/etc/gha-listener/config.json`

Note: This is timing-dependent and does not reproduce on every pod restart. Higher API server latency or reconciliation load increases the failure rate.

Describe the bug

#4033 introduced a race condition in deleteListenerPod(). When a listener pod terminates, the function deletes both the pod and the config secret in the same reconciliation:

func (r *AutoscalingListenerReconciler) deleteListenerPod(ctx context.Context, autoscalingListener *v1alpha1.AutoscalingListener, listenerPod *corev1.Pod, log logr.Logger) error {
if err := r.publishRunningListener(autoscalingListener, false); err != nil {
log.Error(err, "Unable to publish runner listener down metric", "namespace", listenerPod.Namespace, "name", listenerPod.Name)
}
if listenerPod.DeletionTimestamp.IsZero() {
log.Info("Deleting the listener pod", "namespace", listenerPod.Namespace, "name", listenerPod.Name)
if err := r.Delete(ctx, listenerPod); err != nil && !kerrors.IsNotFound(err) {
log.Error(err, "Unable to delete the listener pod", "namespace", listenerPod.Namespace, "name", listenerPod.Name)
return err
}
// delete the listener config secret as well, so it gets recreated when the listener pod is recreated, with any new data if it exists
var configSecret corev1.Secret
err := r.Get(ctx, types.NamespacedName{Namespace: autoscalingListener.Namespace, Name: scaleSetListenerConfigName(autoscalingListener)}, &configSecret)
switch {
case err == nil && configSecret.DeletionTimestamp.IsZero():
log.Info("Deleting the listener config secret")
if err := r.Delete(ctx, &configSecret); err != nil {
return fmt.Errorf("failed to delete listener config secret: %w", err)
}
case !kerrors.IsNotFound(err):
return fmt.Errorf("failed to get the listener config secret: %w", err)
}
}
return nil
}

Reconciliation 1 (pod terminated):
  ├─ r.Delete(listenerPod)       → async deletion starts
  └─ r.Delete(configSecret)      → async deletion starts

Reconciliation 2 (pod not found → createListenerPod):
  ├─ r.Get(configSecret)         → may still exist (deletion pending) or already gone
  ├─ If exists: skip creation    → but secret is being deleted
  ├─ r.Create(newPod)            → pod references the config secret as a volume
  └─ Kubelet mounts volume       → ⚡ secret may be gone by now → FailedMount

Since Kubernetes object deletion is async, there is a race window between the config secret deletion and the new pod's volume mount attempt. The outcome depends on API server latency, garbage collector timing, and kubelet scheduling.

Describe the expected behavior

The config secret should persist across listener pod restarts. It should only be deleted when the AutoscalingListener resource itself is deleted (already handled by cleanupResources()).

If the config secret content needs to be refreshed (e.g., due to token rotation as in #4029), the controller should update the existing secret in place rather than delete it during pod termination.

Additional Context

cf. mercari#13

N/A

Controller Logs

N/A

Runner Pod Logs

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions