Checks
Controller Version
0.13.1 (affects any version that includes #4033)
Deployment Method
Helm
Checks
To Reproduce
1. Deploy ARC with `AutoscalingRunnerSet`
2. Wait for listener pods to terminate and restart due to normal lifecycle events (eviction, OOM, completed, etc.)
3. Observe intermittent `FailedMount` errors or `no such file or directory` on `/etc/gha-listener/config.json`
Note: This is timing-dependent and does not reproduce on every pod restart. Higher API server latency or reconciliation load increases the failure rate.
Describe the bug
#4033 introduced a race condition in deleteListenerPod(). When a listener pod terminates, the function deletes both the pod and the config secret in the same reconciliation:
|
func (r *AutoscalingListenerReconciler) deleteListenerPod(ctx context.Context, autoscalingListener *v1alpha1.AutoscalingListener, listenerPod *corev1.Pod, log logr.Logger) error { |
|
if err := r.publishRunningListener(autoscalingListener, false); err != nil { |
|
log.Error(err, "Unable to publish runner listener down metric", "namespace", listenerPod.Namespace, "name", listenerPod.Name) |
|
} |
|
|
|
if listenerPod.DeletionTimestamp.IsZero() { |
|
log.Info("Deleting the listener pod", "namespace", listenerPod.Namespace, "name", listenerPod.Name) |
|
if err := r.Delete(ctx, listenerPod); err != nil && !kerrors.IsNotFound(err) { |
|
log.Error(err, "Unable to delete the listener pod", "namespace", listenerPod.Namespace, "name", listenerPod.Name) |
|
return err |
|
} |
|
|
|
// delete the listener config secret as well, so it gets recreated when the listener pod is recreated, with any new data if it exists |
|
var configSecret corev1.Secret |
|
err := r.Get(ctx, types.NamespacedName{Namespace: autoscalingListener.Namespace, Name: scaleSetListenerConfigName(autoscalingListener)}, &configSecret) |
|
switch { |
|
case err == nil && configSecret.DeletionTimestamp.IsZero(): |
|
log.Info("Deleting the listener config secret") |
|
if err := r.Delete(ctx, &configSecret); err != nil { |
|
return fmt.Errorf("failed to delete listener config secret: %w", err) |
|
} |
|
case !kerrors.IsNotFound(err): |
|
return fmt.Errorf("failed to get the listener config secret: %w", err) |
|
} |
|
} |
|
return nil |
|
} |
Reconciliation 1 (pod terminated):
├─ r.Delete(listenerPod) → async deletion starts
└─ r.Delete(configSecret) → async deletion starts
Reconciliation 2 (pod not found → createListenerPod):
├─ r.Get(configSecret) → may still exist (deletion pending) or already gone
├─ If exists: skip creation → but secret is being deleted
├─ r.Create(newPod) → pod references the config secret as a volume
└─ Kubelet mounts volume → ⚡ secret may be gone by now → FailedMount
Since Kubernetes object deletion is async, there is a race window between the config secret deletion and the new pod's volume mount attempt. The outcome depends on API server latency, garbage collector timing, and kubelet scheduling.
Describe the expected behavior
The config secret should persist across listener pod restarts. It should only be deleted when the AutoscalingListener resource itself is deleted (already handled by cleanupResources()).
If the config secret content needs to be refreshed (e.g., due to token rotation as in #4029), the controller should update the existing secret in place rather than delete it during pod termination.
Additional Context
cf. mercari#13
Controller Logs
Runner Pod Logs
Checks
Controller Version
0.13.1 (affects any version that includes #4033)
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
#4033 introduced a race condition in
deleteListenerPod(). When a listener pod terminates, the function deletes both the pod and the config secret in the same reconciliation:actions-runner-controller/controllers/actions.github.com/autoscalinglistener_controller.go
Lines 277 to 303 in 74cfc38
Since Kubernetes object deletion is async, there is a race window between the config secret deletion and the new pod's volume mount attempt. The outcome depends on API server latency, garbage collector timing, and kubelet scheduling.
Describe the expected behavior
The config secret should persist across listener pod restarts. It should only be deleted when the
AutoscalingListenerresource itself is deleted (already handled bycleanupResources()).If the config secret content needs to be refreshed (e.g., due to token rotation as in #4029), the controller should update the existing secret in place rather than delete it during pod termination.
Additional Context
cf. mercari#13
N/AController Logs
Runner Pod Logs