Envoy AI Gateway v0.5.0
Multi-gateway configuration, prompt caching cost savings, fine-grained MCP authorization, OpenAI Responses API, and Google Search grounding for Gemini.
Envoy AI Gateway v0.5.0 makes multi-gateway deployments easier with the new GatewayConfig CRD, cuts costs with prompt caching for AWS Bedrock and GCP Claude, and unlocks fine-grained access control with CEL-based MCP authorization. Developers gain OpenAI Responses API support, Google Search grounding for Gemini, and the ability to mutate request bodies per-route. Under the hood, the switch to sonic JSON processing reduces latency across all requests.
✨ New Features
Gateway Configuration
- New
GatewayConfigCRD — Gateway-scoped configuration via a new custom resource. Reference it from a Gateway via theaigateway.envoyproxy.io/gateway-configannotation to configure the external processor container (env vars, resource requirements, container settings). Multiple Gateways can share the same GatewayConfig. - Configurable endpoint prefixes — New
prefixfield onVersionedAPISchemafor backends with non-standard OpenAI-compatible prefixes (e.g., Gemini's/v1beta/openai, Cohere's/compatibility/v1).
OpenAI API Support
- OpenAI Responses API (
/v1/responses) — Full support with streaming and non-streaming modes, function calling, MCP tools, reasoning, multi-turn conversations, multimodal capabilities, token usage tracking, and OpenInference tracing.
Provider Caching Enhancements
- Prompt caching for AWS Bedrock Claude — Reuse cached system prompts with Bedrock Anthropic models. Cache point markers are handled automatically with separate tracking for cache creation and cache hit tokens.
- Prompt caching for GCP Vertex AI Claude — Same cost-saving prompt caching for Claude models on GCP Vertex AI for system prompts and few-shot examples.
MCP Gateway Enhancements
- Fine-grained authorization with CEL, JWT claims, and external auth — Write expressive CEL rules using request attributes (HTTP method, headers, JWT claims, tool names, call arguments), enforce access based on JWT claim values, or delegate to external gRPC/HTTP authorization services.
- Real-time tool list synchronization — MCP clients automatically receive
notifications/tools/list_changedwhen MCPRoutes update, refreshing available tools without reconnection. - Stdio server proxy in standalone mode — Run command-line MCP tools (e.g.,
npx-based servers) without code changes via theaigwCLI HTTP proxy. - Improved OAuth metadata discovery — Well-known endpoints now serve at the MCPRoute path prefix for correct authorization discovery across multiple routes.
Inference Extension
- Security policies for inference pools — Apply
BackendSecurityPolicytoInferencePoolresources for consistent authentication across dynamically-selected inference endpoints.
Gemini Provider Enhancements
- Google Search grounding — Give Gemini models access to real-time web information via the
google_searchtool type with domain filtering, blocking confidence thresholds, and time range restrictions. - Consistent thinking configuration across providers — Same
thinkingconfiguration works for both Anthropic and Gemini models for provider-agnostic reasoning features. - Gemini 3 reasoning and image quality controls —
thinking_level(reasoning depth) andmedia_resolution(image quality vs. speed) with graceful degradation on older Gemini versions. - Visibility into model reasoning — Thought summaries extracted and surfaced from Gemini responses when thinking is enabled.
- Enterprise web search integration —
enterprise_searchtool type for grounding responses in organization-specific search infrastructure and data sources.
Traffic Management
- Route-level body mutation — Inject or remove JSON fields in request bodies per-backend using
bodyMutationwithsetandremoveoperations. Route-level settings override backend defaults. - AWS Bedrock service tier control — Choose between standard, flex, priority, and reserved tiers for latency-sensitive or cost-optimized workloads with automatic fallback handling.
Observability Enhancements
- Per-provider cost attribution — New
gen_ai.provider.namemetric attribute for filtering dashboards and alerts by provider. - Full tracing for Anthropic Messages API — OpenInference-compliant tracing for the native
/messagesendpoint, compatible with Arize Phoenix and OpenTelemetry platforms. - Cohere Rerank visibility — Full OpenTelemetry support for Cohere's v2 rerank endpoint capturing query, documents, and relevance scores.
Performance and Operations
- Faster request processing with sonic JSON — Migrated to bytedance/sonic for JSON encoding/decoding with measurable latency improvements and lower CPU usage.
- Faster cross-namespace reference validation — Optimized ReferenceGrant indexing reduces controller reconciliation time.
- Improved MCP proxy throughput — HTTP connection reuse across MCP proxy requests eliminates per-request connection overhead. Details →
🔗 API Updates
- New
GatewayConfigCRD — Gateway-level configuration withextProc.kubernetesfor container settings. Reference viaaigateway.envoyproxy.io/gateway-configannotation. VersionedAPISchema.prefix— Newprefixfield replaces overloadingversionfor endpoint path customization.AIGatewayRouteRuleBackendRef.bodyMutation— New field withset(field/value pairs) andremove(field names) for request body manipulation.LLMRequestCostType.CacheCreationInputToken— New cost type for tokens written to cache, separate fromCachedInputToken.MCPRouteSecurityPolicyauthorization fields — Newauthorizationblock withdefaultAction,rulesarray (CEL, JWT scopes/claims, tools targeting), andextAuthfor external authorization.BackendSecurityPolicy.targetRefsexpansion — Now acceptsInferencePool(inference.networking.x-k8s.io) in addition toAIServiceBackend.
Deprecations
AIGatewayFilterConfigExternalProcessor.resources— Deprecated. UseGatewayConfiginstead. Will be removed in v0.6.versionfield as prefix for OpenAI schema — Deprecated. Use the newprefixfield. Legacy behavior will be removed in v0.6.
🐛 Bug Fixes
- AWS Bedrock Claude streaming reliability — Streaming responses from Bedrock Claude models now complete correctly without truncation.
- Gemini streaming token counts — Token usage in Gemini streaming responses now matches OpenAI format.
- Multi-chunk Gemini tool calls — Tool calls spanning multiple streaming chunks now have correct indices.
- GCP Claude reasoning content — Reasoning/thinking content correctly passes through for Claude on GCP Vertex AI.
- Zero-weight backend references — Backend references with zero weight no longer cause routing errors.
- Umbrella chart image pull secrets — Helm deployments within umbrella charts correctly inherit
global.imagePullSecrets. - GCP global region backends — Vertex AI backends with global region now work correctly.
- Accurate per-token latency metrics — Fixed integer truncation in
time_per_output_tokencalculation. - Anthropic token counting — Improved accuracy of input and output token counts for Anthropic models.
📖 Upgrade Guidance
Migrating to GatewayConfig
If you're using AIGatewayFilterConfigExternalProcessor.resources, migrate to the new GatewayConfig CRD:
- Create a
GatewayConfigresource:
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: GatewayConfig
metadata:
name: my-gateway-config
namespace: default
spec:
extProc:
kubernetes:
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317"- Reference from your Gateway:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: ai-gateway
annotations:
aigateway.envoyproxy.io/gateway-config: my-gateway-configMigrating Endpoint Prefix Configuration
Before:
schema:
name: OpenAI
version: "/v1beta/openai" # DeprecatedAfter:
schema:
name: OpenAI
prefix: "/v1beta/openai"📦 Dependencies
| Dependency | Version |
|---|---|
| Go | 1.25.6 |
| Envoy Gateway | v1.6 |
| Envoy Proxy | v1.36.4 |
| Gateway API | v1.4.0 |
| Gateway API Inference Extension | v1.0.2 |
🙏 Acknowledgements
Special thanks to the growing community of adopters including Bloomberg, LY Corporation, Alan by Comma Soft, and NRP for their production insights, everyone who reported bugs, submitted PRs, and participated in design discussions, and the Envoy Gateway team for continued collaboration.
🔮 What's Next
- Additional provider integrations (AWS Bedrock InvokeModel, Gemini embeddings, Azure/AKS workload identity)
- Batch inference APIs for high-volume workloads
- Advanced caching strategies with prompt cache key and retention controls
- Upstream provider quota policies
- Sensitive data redaction for request and response bodies