Skip to content

feat(server,sandbox): move SSH connect and exec onto supervisor session relay#861

Draft
pimlock wants to merge 2 commits intomainfrom
feat/supervisor-session-relay
Draft

feat(server,sandbox): move SSH connect and exec onto supervisor session relay#861
pimlock wants to merge 2 commits intomainfrom
feat/supervisor-session-relay

Conversation

@pimlock
Copy link
Copy Markdown
Collaborator

@pimlock pimlock commented Apr 16, 2026

Summary

Introduces a persistent supervisor-to-gateway session (ConnectSupervisor) and migrates /connect/ssh and ExecSandbox onto relay channels, removing the requirement for direct gateway-to-sandbox network connectivity.

  • gRPC control plane: session lifecycle (hello/heartbeat) + relay lifecycle (RelayOpen/RelayOpenResult/RelayClose)
  • HTTP data plane: per-relay reverse HTTP CONNECT tunnels from supervisor to gateway — raw bytes, no protobuf framing
  • Supervisor: dumb byte bridge with zero SSH/NSSH1 protocol awareness

Removes ResolveSandboxEndpoint from the proto, gateway, and K8s driver.

Closes OS-86. Design: RFC 0002, Plan.

TODO

  • Switch SSH daemon to Unix socket listener — removes the need for exposed port 2222 and the NSSH1 handshake entirely (filesystem permissions become the access control boundary)
  • Remove NSSH1 preface/verification from gateway and SSH daemon once Unix socket is in place
  • Harden relay timeout and failure scenarios (supervisor crash mid-relay, gateway restart, concurrent relay limit)
  • Gate supervisor session establishment before sandbox reports Ready (currently 502 on first connect if session isn't up yet)
  • Add integration test: full relay round-trip (mock supervisor + gateway + relay endpoint)
  • OCSF telemetry for session connect/disconnect, relay open/close, relay bridge errors
  • Clean up session registry on sandbox delete (cleanup_sandbox in compute runtime)
  • Update internal architecture docs (architecture/sandbox-connect.md, architecture/gateway.md, architecture/sandbox.md)
  • Update user-facing docs under docs/ for the new connectivity model

Test plan

  • sandbox exec works through relay (verified locally on nemoclaw cluster)
  • sandbox connect works through relay (verified locally)
  • Unit tests pass (209 server, 96 core, 18 K8s driver — 343 total)
  • All integration tests pass (multiplex, TLS, WebSocket, auth)
  • SFTP/scp through relay
  • SSH port forwarding through relay
  • Concurrent SSH sessions on one supervisor session
  • Gateway restart mid-session
  • Supervisor restart mid-relay

…on relay

Introduce a persistent supervisor-to-gateway session (ConnectSupervisor
bidirectional gRPC RPC) and migrate /connect/ssh and ExecSandbox onto
relay channels coordinated through it.

Architecture:
- gRPC control plane: carries session lifecycle (hello, heartbeat) and
  relay lifecycle (RelayOpen, RelayOpenResult, RelayClose)
- HTTP data plane: for each relay, the supervisor opens a reverse HTTP
  CONNECT to /relay/{channel_id} on the gateway; the gateway bridges
  the client stream with the supervisor stream
- The supervisor is a dumb byte bridge with no SSH/NSSH1 awareness;
  the gateway sends the NSSH1 preface through the relay

Key changes:
- Add ConnectSupervisor RPC and session/relay proto messages
- Add gateway session registry (SupervisorSessionRegistry) with
  pending-relay map for channel correlation
- Add /relay/{channel_id} HTTP CONNECT endpoint
- Rewire /connect/ssh: session lookup + RelayOpen instead of direct
  TCP dial to sandbox:2222
- Rewire ExecSandbox: relay-based proxy instead of direct sandbox dial
- Add supervisor session client with reconnect and relay bridge
- Remove ResolveSandboxEndpoint from proto, gateway, and K8s driver

Closes OS-86
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 16, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@pimlock pimlock added the test:e2e Requires end-to-end coverage label Apr 16, 2026
When a sandbox first reports Ready, the supervisor session may not have
completed its gRPC handshake yet. Instead of failing immediately with
502 / "supervisor session not connected", the relay open now retries
with exponential backoff (100ms → 2s) for up to 15 seconds.

This fixes the race between K8s marking the pod Ready and the
supervisor establishing its ConnectSupervisor session.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant