Add sidecar injection webhook proposal by tombentley · Pull Request #100 · kroxylicious/design

tombentley · 2026-04-17T04:01:39Z

Assisted-by: Claude Opus 4.6 noreply@anthropic.com

k-wall · 2026-04-23T08:29:56Z

Discussed at Community Call 23rd April. Tom is looking for feedback.

k-wall · 2026-04-24T10:53:52Z

+| `kroxylicious.io/config-generation` | The `metadata.generation` of the `KroxyliciousSidecarConfig` at injection time |
+| `kroxylicious.io/injection-timestamp` | ISO-8601 timestamp of when the sidecar was injected |
+
+Because the webhook only mutates pods at creation time, configuration changes to `KroxyliciousSidecarConfig` do not propagate to running pods. This matches how Istio and Linkerd handle sidecar injection. Users must restart pods to pick up new configuration.


Sounds very sensible

Not really. Because the proxy configuration is built from the KroxyliciousSidecarConfig at injection time it means that any subsequent changes to the KroxyliciousSidecarConfig are ignored until the next time the pod get created. That creates a lot of problems:

The user who has RBAC on the KroxyliciousSidecarConfig might not have RBAC on the Pod, Deployment etc. So it assumes some kind of communication between those users. Such communication might not be possible in practice.

Erroneous changes to the KroxyliciousSidecarConfig are only discovered at some unbounded time later.

The right way to solve this would be for the proxy to phone home to obtain a configuration from a control plane, and to receive updates to that configuration without the need for Pod restarts. But I really want to avoid boiling the ocean here by blocking this alpha sidecar feature on having such a control plane. I think the control plane piece is something we could add later (also as an alpha feature) and then maybe they would become stable features together later on.

k-wall · 2026-04-24T11:26:09Z

Thanks @tombentley, looks like a solid proposal. I've left a few comments to be considered.

tombentley

Thanks @k-wall @robobario, I think I've addresses your comments, though with a watering down of some of the initial ambition wrt trust boundaries. PTAL.

tombentley · 2026-04-27T03:52:53Z

+| `kroxylicious.io/config-generation` | The `metadata.generation` of the `KroxyliciousSidecarConfig` at injection time |
+| `kroxylicious.io/injection-timestamp` | ISO-8601 timestamp of when the sidecar was injected |
+
+Because the webhook only mutates pods at creation time, configuration changes to `KroxyliciousSidecarConfig` do not propagate to running pods. This matches how Istio and Linkerd handle sidecar injection. Users must restart pods to pick up new configuration.


Not really. Because the proxy configuration is built from the KroxyliciousSidecarConfig at injection time it means that any subsequent changes to the KroxyliciousSidecarConfig are ignored until the next time the pod get created. That creates a lot of problems:

The user who has RBAC on the KroxyliciousSidecarConfig might not have RBAC on the Pod, Deployment etc. So it assumes some kind of communication between those users. Such communication might not be possible in practice.

Erroneous changes to the KroxyliciousSidecarConfig are only discovered at some unbounded time later.

The right way to solve this would be for the proxy to phone home to obtain a configuration from a control plane, and to receive updates to that configuration without the need for Pod restarts. But I really want to avoid boiling the ocean here by blocking this alpha sidecar feature on having such a control plane. I think the control plane piece is something we could add later (also as an alpha feature) and then maybe they would become stable features together later on.

Add PRs kroxylicious#96, kroxylicious#98, kroxylicious#99, kroxylicious#100, kroxylicious#101, and kroxylicious#103 which were opened after the initial script was created. Note: PR kroxylicious#100 is already correctly named (100-sidecar-injection-webhook.md). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

This change simplifies the proposal numbering system by using PR numbers as proposal identifiers, eliminating number collisions and removing the need for a separate allocation process. Changes: - Simplified proposals/README.md to focus on author workflow - Removed index tables (directory listing serves as the index) - Streamlined instructions for creating and renaming proposals - Updated proposal template with workflow instructions - Require PR number in title format: # <PR#> - <Title> - Moved workflow instructions into comment block - Added GitHub workflow to automatically check proposal numbering - Validates both filename and title format - Updates PR description when proposal files don't match PR number - Provides exact commands to fix naming issues - Removes warning once corrected - Handles both added and renamed files - Runs on all PRs (ready for mandatory status check) - Added notification script for existing open PRs - After merge, run notify-open-prs.sh to ask authors to rebase - Workflow will automatically guide them through renaming - Updated with all current open proposal PRs (kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#85, kroxylicious#88, kroxylicious#93, kroxylicious#94, kroxylicious#96, kroxylicious#98, kroxylicious#99, kroxylicious#100, kroxylicious#101, kroxylicious#103) Proposals 001-019 retain their original numbers. Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Keith Wall <kwall@apache.org>

This change simplifies the proposal numbering system by using PR numbers as proposal identifiers, eliminating number collisions and removing the need for a separate allocation process. Changes: - Simplified proposals/README.md to focus on author workflow - Removed index tables (directory listing serves as the index) - Streamlined instructions for creating and renaming proposals - Updated proposal template with workflow instructions - Require PR number in title format: # <PR#> - <Title> - Moved workflow instructions into comment block - Added GitHub workflow to automatically check proposal numbering - Validates both filename and title format - Updates PR description when proposal files don't match PR number - Provides exact commands to fix naming issues - Removes warning once corrected - Handles both added and renamed files - Runs on all PRs (ready for mandatory status check) - Added notification script for existing open PRs - After merge, run notify-open-prs.sh to ask authors to rebase - Workflow will automatically guide them through renaming - Updated with all current open proposal PRs (#70, #82, #83, #85, #88, #93, #94, #96, #98, #99, #100, #101, #103) Proposals 001-019 retain their original numbers. Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Keith Wall <kwall@apache.org>

k-wall · 2026-04-28T12:46:16Z

Hi! We've updated the proposal numbering system to use PR numbers as proposal identifiers.

Action required: Please rebase your PR on main.

Once you rebase, you'll need to rename your proposal file and update the title:

git mv proposals/100-sidecar-injection-webhook.md proposals/100-injection-webhook.md
# Update title: remove any old number prefix and add PR number
sed -i.bak '0,/^# /{s/^# \([0-9]\{3\}\|xxx\|nnn\|000\) - /# 100 - /; t; s/^# /# 100 - /}' proposals/100-injection-webhook.md && rm proposals/100-injection-webhook.md.bak
git add proposals/100-injection-webhook.md
git commit -m "Rename proposal to use PR number"
git push

The GitHub workflow will automatically check your proposal file naming after you push and update this PR description if any corrections are still needed.

See proposals/README.md for the updated workflow.

k-wall · 2026-04-28T13:02:07Z

Correction: The previous notification had an incorrect filename. Here are the correct commands:

git mv proposals/100-sidecar-injection-webhook.md proposals/100-sidecar-injection-webhook.md
# Update title: remove any old number prefix and add PR number
sed -i.bak '0,/^# /{s/^# \([0-9]\{3\}\|xxx\|nnn\|000\) - /# 100 - /; t; s/^# /# 100 - /}' proposals/100-sidecar-injection-webhook.md && rm proposals/100-sidecar-injection-webhook.md.bak
git add proposals/100-sidecar-injection-webhook.md
git commit -m "Rename proposal to use PR number"
git push

Sorry for the confusion!

Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Document the KroxyliciousSidecarConfig status subresource (Ready condition, observedGeneration) to match the implementation. Update module names and shared dependencies to reflect actual repo structure. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Document pod-level securityContext (runAsNonRoot, seccompProfile: RuntimeDefault). Fix probe port from 9190 to management port (9082). Add actual probe parameters from implementation. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Status update failures should be retried, not silently swallowed. Remove pod-level securityContext from webhook; admins should use PodSecurity admission policies instead to avoid webhook ordering conflicts. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Per review consensus, delegated plugin image selection is too high risk (arbitrary code execution in the proxy JVM). Admin-specified plugins via spec.plugins remain unaffected. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Move targetBootstrapServers, bootstrapPort, nodeIdRange, and targetClusterTls under a virtualClusters list. Remove delegated annotations section. Rewrite target cluster selection as a virtual clusters section. Update trust model, port allocation table, and future delegation to reflect the simplified alpha scope. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Add design for secretMounts field: motivation (secrets must not be in pod annotations), trust model progression (filesystem isolation now, network isolation later), and rationale for secretMounts over generic volume mounts. Update CRD example. Expand FEATURE_GATES rationale: the API server version does not reveal which feature gates are enabled, so explicit configuration avoids additional RBAC. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Use sidecar.kroxylicious.io/injection: disabled for pod opt-out, matching the namespace label key. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

The current container-name check is weak against adversarial pods and does not detect config drift during reinvocation. Document comparing the proxy-config annotation against the expected output as a stronger alternative, noting that Jackson serialisation is deterministic. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Update CRD group from kroxylicious.io to sidecar.kroxylicious.io to avoid conflicts with the operator. Add skip labelling section describing the injection-skipped label for operator observability. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Move Config injection and Configuration drift detection before Status, since Status references the config-generation annotation those sections introduce. Fix "an internal errors" typo. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Signed-off-by: Tom Bentley <tbentley@redhat.com>

tombentley · 2026-05-06T01:06:58Z

@robobario @SamBarker @k-wall I've tried to address your previous comments, and also bring this into live with the implementation PR. Please take another look.

robobario · 2026-05-06T02:23:03Z

+- MutatingWebhookConfiguration
+- cert-manager Certificate (optional)
+
+**TLS**: Kubernetes requires HTTPS for admission webhooks. The primary path uses cert-manager with a self-signed issuer. A manual alternative (admin provides cert/key Secret) is documented. The webhook watches cert files for rotation and reloads the SSLContext.


should we mention that initially we will only support PEMs with PKCS8 RSA/EC private keys, and whatever the cert format is we can parse.

SamBarker · 2026-05-07T04:18:55Z

+
+The mount path is derived automatically from the `name` field — the admin does not specify `mountPath` directly. This keeps the sidecar's filesystem layout under webhook control, consistent with the operator's approach of using fixed base paths for secret volumes.
+
+**Why `secretMounts` over generic `sidecarVolumes`/`sidecarVolumeMounts`?** It signals clear intent (admin-controlled secrets for the sidecar), limits the surface to Secrets rather than arbitrary volumes, and is easier to validate. The field can be generalised later without breaking the existing API.


secretMounts[].secretName is a cross-resource reference — without validating
that the referenced Secret exists in the pod's namespace, a
KroxyliciousSidecarConfig with a bad reference passes admission silently and
only fails when Kubernetes tries to mount the volume at scheduling time. This
validation needs to be part of the alpha implementation, not deferred. It is also
exactly the case where Ready=False would be most valuable — see separate
comment on § Status.

The operator has validation logic for LocalObjectReference fields that handles
this; the webhook should follow the same pattern.

secretMounts[].secretName is a cross-resource reference — without validating that the referenced Secret exists in the pod's namespace, a KroxyliciousSidecarConfig with a bad reference passes admission silently and only fails when Kubernetes tries to mount the volume at scheduling time.

True.

This validation needs to be part of the alpha implementation, not deferred.

I don't think that's necessary -- if we were to add this later it would not impact the API or be a breaking change. So I think 'need' is too strong.

It is also exactly the case where Ready=False would be most valuable — see separate comment on § Status.

The operator has validation logic for LocalObjectReference fields that handles this; the webhook should follow the same pattern.

While I agree the concern is valid, but let's consider the implications of doing this:

The webhook currently needs zero Secret access. Adding validation means get on Secrets in every namespace where a sidecar config might exist. A webhook service account that can read arbitrary Secrets is a high-value target, and turns a compromise of the webhook into a credential exfiltration vector. This directly undermines the "minimal RBAC" design the proposal currently has. So Expecting a kube admin to grant this unknown new webhook access to all the secrets in all the namespaces seems to be asking a heck of a lot.

While it provides a way to notify the person configuring the sidecar config at the time the sidecar config is created, there's nothing stopping the secret from being deleted afterwards. It's at the point when the pod's created that the secret needs to exist. But under the two actor design there's no guarantee that the side car admin is paying attention at the point the pod is created by the app owner.

The operator analogy is misleading. The operator has a controller that continuously reconciles, so it can re-check references and update status. The webhook fires once at pod admission. If it sets Ready=False because a Secret is missing, when does it set Ready=True? Only options: watch all Secrets (enormous scope, back to the RBAC problem) or only update on the next pod admission (leaving the status stale and misleading for potentially hours or days).

A pod referencing a nonexistent Secret via a volume mount won't schedule. kubectl describe pod shows exactly what's missing. This is a standard failure mode every Kubernetes operator knows how to debug. Adding a second layer of validation doesn't prevent the problem, it just moves the error message to a different place, while adding complexity and RBAC surface.

If you validate secretMounts, logical consistency demands validating trustAnchorSecretRef too, and any future secret references. This sets a precedent where the webhook must validate all cross-resource references, compounding the RBAC and complexity costs.

In progressive delivery workflows (Argo CD, Flux), the KroxyliciousSidecarConfig and the Secret may be in different apps or kustomizations and deploy in arbitrary order. Rejecting the config because the Secret doesn't exist yet breaks these workflows.

Thanks for the detailed response. The RBAC concern and the progressive delivery point are well taken — I accept that validating at config-creation time is genuinely problematic.

But I want to push back on the framing that kubectl describe pod is an adequate UX story here. This isn't injection being skipped — the sidecar gets injected, the pod fails to start, and the only signal is a volume mount error buried in pod events. The KroxyliciousSidecarConfig shows Ready=True. There's no trail back to the config.

This is also the case you named as the prime example of where Ready=False would be valuable ("e.g. referencing a non-existent Secret"). If that case can't be covered, the Ready=False story is narrower than it appeared.

I'm not insisting on fixing this in alpha — the reconciliation problem is real. But I think the proposal should explicitly acknowledge this as a known UX gap rather than leaving it implicit, and note that a future reconciliation controller is the right fix.

Sketching a future enhancement that addresses the UX gap without changing the webhook's minimal-RBAC design:

Operator sidecar reconciler (future)

The operator already has the RBAC, reconciliation loop, and CRD machinery that the webhook deliberately avoids. A future operator enhancement could add a reconciler that watches KroxyliciousSidecarConfig resources cluster-wide and provides what the webhook can't:

Referential integrity: watch referenced Secrets and TLS trust anchor Secrets. Set Ready=False (reason: MissingSecret, with the name) when a reference is broken; flip back to Ready=True when resolved. This handles both the initial-creation case and the "Secret deleted after the fact" case that a one-shot webhook check can't catch.

Drift surfacing: watch pods with sidecar.kroxylicious.io/config-generation annotations. Compare against the current KroxyliciousSidecarConfig generation and surface an UpToDate condition on the CRD status — "3 pods running stale config" — without requiring manual jq queries.

Optional auto-roll (opt-in): if KroxyliciousSidecarConfig carries a sidecar.kroxylicious.io/auto-roll: enabled annotation, the reconciler triggers a rolling restart of stale pods by patching the owning Deployment/StatefulSet pod template annotation (the standard kubectl rollout restart mechanism). Opt-in so teams that manage their own rollouts aren't surprised.

The webhook stays unchanged — inject, set Ready=True, done. The operator adds observability and optionally remediation on top. Neither component needs to compromise its design to support the other.

This is the right vehicle for the "future reconciler" the proposal already gestures at in § Configuration drift detection.

the sidecar gets injected, the pod fails to start, and the only signal is a volume mount error buried in pod events

But that is something that anyone familiar with kube will know how to deal with.

There's no trail back to the config.

That's not really true. If you describe the pod you'll see the secret volume, the volume mount, that the volume mount is in an injected sidecar. I think that's enough for a competant person or AI to figure out that they need to involve the kube administrator (if they don't directly know about the kroxylicious sidecar already). It's not exemplary UX, but it's enough.

If that case can't be covered, the Ready=False story is narrower than it appeared.

I get your disappointment. I'd also like to have a better UX here. But that just doesn't seem to fit within the role of a webhook, which is what this proposal is about.

But I think the proposal should explicitly acknowledge this as a known UX gap rather than leaving it implicit,

I'm happy to do that.

note that a future reconciliation controller is the right fix.

I don't think either of us really know the extent to which this is a real problem for people in practice. So I'd prefer to avoid making something that looks like a commitment to future work in this proposal.

Operator sidecar reconciler (future)

We've been following the path that the admission webhook is strictly orthogonal to the current operator. That's because :

we think there will be people who are interested in one but not the other, because

they're fundamentally different deployment styles for using a proxy, with different pros and cons.

So I think deciding to extend the scope of the existing operator to include the sidecar CR is premature.

Sure, it's one option. Another option is to have a dedicated operator for sidecar configs. Another option is to do nothing. How we weigh these options depends on feedback we get from users.

SamBarker · 2026-05-07T04:26:24Z

+### Virtual clusters
+
+The `virtualClusters` list defines the target Kafka clusters that the sidecar proxy will serve. Each entry specifies a name, target bootstrap address, localhost listening port, node ID range, and optional TLS configuration.
+


Really like the forward-looking list structure here. Active/passive DR is a
compelling use case — the sidecar exposes both live and DR clusters, and failover
is a config change plus pod restart. Could be a great showcase for the project.

Rename injection-skipped label values to reference CRD kind directly, rewrite idempotency section to state requirements not implementation, add Ready=False for invalid configs, rename FEATURE_GATES to K8S_FEATURE_GATES, and add explicit seccompProfile: RuntimeDefault. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>

Update proposal to reflect implementation: multi-replica deployment with PDB, "one or more" configs per namespace, replace app-level FAILURE_POLICY with K8s failurePolicy for errors and new UNINJECTED_POD_POLICY for config edge cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SamBarker · 2026-05-07T22:12:26Z

+| `kroxylicious-app` | Already merged | `classpath-plugins/` directory scanning |
+| `kroxylicious-operator` | No | |
+| `kroxylicious-runtime` | No | Used as a dependency, not modified |
+| `kroxylicious-filters` | No | |


Two permissions are missing from this list:

get (or list/watch via informer) on Secrets — needed to validate secretMounts[].secretName references at config-observation time.

update on KroxyliciousSidecarConfig/status — needed to write the Ready condition described in § Status.

tombentley force-pushed the web-hook branch from a666c11 to 50aba24 Compare April 17, 2026 04:03

robobario reviewed Apr 19, 2026

View reviewed changes

Comment thread proposals/100-sidecar-injection-webhook.md Outdated

Comment thread proposals/100-sidecar-injection-webhook.md

Comment thread proposals/100-sidecar-injection-webhook.md Outdated

Comment thread proposals/100-sidecar-injection-webhook.md Outdated

tombentley marked this pull request as ready for review April 23, 2026 08:30

tombentley requested a review from a team as a code owner April 23, 2026 08:30

tombentley moved this to Want to Do in Release 0.21.0 Apr 24, 2026

tombentley added this to Release 0.21.0 Apr 24, 2026

tombentley moved this from Want to Do to Must Do in Release 0.21.0 Apr 24, 2026

tombentley moved this from Must Do to Blocked in Release 0.21.0 Apr 24, 2026

tombentley mentioned this pull request Apr 24, 2026

Add sidecar injection webhook module (Phase 1) kroxylicious/kroxylicious#3790

Merged

tombentley self-assigned this Apr 24, 2026