|
| 1 | +# VM GPU passthrough: implementation plan |
| 2 | + |
| 3 | +> Design: [vm-gpu-passthrough.md](vm-gpu-passthrough.md) |
| 4 | +
|
| 5 | +## Phase 0 -- Specification and failing test (current) |
| 6 | + |
| 7 | +- [x] Design doc. |
| 8 | +- [x] Phase 0.5 VMM decision (cloud-hypervisor selected). |
| 9 | +- [ ] **`gpu_passthrough` module** integrated into `crates/openshell-vm/src/`: |
| 10 | + - `probe_host_nvidia_vfio_readiness()` -- Linux sysfs scan; non-Linux returns `UnsupportedPlatform`. |
| 11 | + - `nvidia_gpu_available_for_vm_passthrough()` -- hard-coded `false` until end-to-end passthrough works. |
| 12 | + - **Note:** `gpu_passthrough.rs` and `gpu_passthrough_implementation.rs` exist as untracked files at the repo root but are not wired into the crate module tree (`lib.rs` does not `mod gpu_passthrough;`). Move them into `crates/openshell-vm/src/`, add `pub mod gpu_passthrough;`, and ensure `cargo test -p openshell-vm` compiles them. |
| 13 | +- [ ] **Failing integration test** `tests/gpu_passthrough_implementation.rs` -- documents the target and fails until implementation is finished. |
| 14 | + |
| 15 | +**Running the red test:** `cargo test -p openshell-vm --test gpu_passthrough_implementation` |
| 16 | + |
| 17 | +**Note:** `mise run test` uses `cargo test --workspace --exclude openshell-vm`, so default CI stays green. |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## Phase 1 -- VMM backend abstraction and cloud-hypervisor integration |
| 22 | + |
| 23 | +### 1a. Backend trait and libkrun extraction |
| 24 | + |
| 25 | +Refactor only -- no behavior changes. Existing tests must still pass. |
| 26 | + |
| 27 | +- [ ] Create `src/backend.rs` with the `VmBackend` trait: |
| 28 | + |
| 29 | +```rust |
| 30 | +pub trait VmBackend { |
| 31 | + fn launch(&self, config: &VmLaunchConfig) -> Result<i32, VmError>; |
| 32 | +} |
| 33 | + |
| 34 | +pub struct VmLaunchConfig { |
| 35 | + pub base: VmConfig, |
| 36 | + pub vfio_device: Option<String>, |
| 37 | +} |
| 38 | +``` |
| 39 | + |
| 40 | +- [ ] Create `src/backend/libkrun.rs` -- move into `LibkrunBackend`: |
| 41 | + - `VmContext` struct and all methods (current `lib.rs` lines 584-811) |
| 42 | + - gvproxy setup block inside `NetBackend::Gvproxy` (lines 1337-1466) |
| 43 | + - fork + waitpid + signal forwarding (lines 1525-1710) |
| 44 | + - bootstrap block (lines 1648-1663) |
| 45 | +- [ ] Extract shared gvproxy startup into a helper used by both backends. |
| 46 | +- [ ] Update `launch()` to dispatch: |
| 47 | + |
| 48 | +```rust |
| 49 | +pub fn launch(config: &VmLaunchConfig) -> Result<i32, VmError> { |
| 50 | + // ... existing pre-launch checks ... |
| 51 | + |
| 52 | + if config.vfio_device.is_some() { |
| 53 | + #[cfg(not(target_os = "linux"))] |
| 54 | + return Err(VmError::HostSetup( |
| 55 | + "GPU passthrough requires Linux with KVM and IOMMU".into(), |
| 56 | + )); |
| 57 | + |
| 58 | + #[cfg(target_os = "linux")] |
| 59 | + { |
| 60 | + let backend = CloudHypervisorBackend::new()?; |
| 61 | + return backend.launch(config); |
| 62 | + } |
| 63 | + } |
| 64 | + |
| 65 | + LibkrunBackend.launch(config) |
| 66 | +} |
| 67 | +``` |
| 68 | + |
| 69 | +- [ ] `ffi.rs` stays as-is -- only used by `LibkrunBackend`. |
| 70 | + |
| 71 | +### 1b. cloud-hypervisor backend |
| 72 | + |
| 73 | +- [ ] Create `src/backend/cloud_hypervisor.rs` implementing `VmBackend`. |
| 74 | +- [ ] REST API client -- HTTP/1.1 over Unix socket, ~5 endpoints: |
| 75 | + |
| 76 | +``` |
| 77 | +PUT /api/v1/vm.create -- configure VM |
| 78 | +PUT /api/v1/vm.boot -- start VM |
| 79 | +PUT /api/v1/vm.shutdown -- graceful stop |
| 80 | +GET /api/v1/vm.info -- status check |
| 81 | +PUT /api/v1/vm.delete -- cleanup |
| 82 | +``` |
| 83 | + |
| 84 | +Use `hyper` over Unix socket (already in dependency tree) or raw HTTP. Avoid adding `cloud-hypervisor-client` crate for ~5 calls. |
| 85 | + |
| 86 | +- [ ] VM create payload mapping from `VmConfig`: |
| 87 | + |
| 88 | +```json |
| 89 | +{ |
| 90 | + "cpus": { "boot_vcpus": 4 }, |
| 91 | + "memory": { "size": 8589934592 }, |
| 92 | + "payload": { |
| 93 | + "kernel": "/path/to/vmlinux", |
| 94 | + "cmdline": "console=hvc0 root=virtiofs:rootfs rw init=/srv/openshell-vm-init.sh" |
| 95 | + }, |
| 96 | + "fs": [ |
| 97 | + { "tag": "rootfs", "socket": "/path/to/virtiofsd.sock", "num_queues": 1, "queue_size": 1024 } |
| 98 | + ], |
| 99 | + "disks": [ |
| 100 | + { "path": "/path/to/state.raw", "readonly": false } |
| 101 | + ], |
| 102 | + "net": [ |
| 103 | + { "socket": "/path/to/gvproxy-qemu.sock", "mac": "5a:94:ef:e4:0c:ee" } |
| 104 | + ], |
| 105 | + "vsock": { |
| 106 | + "cid": 3, |
| 107 | + "socket": "/path/to/vsock.sock" |
| 108 | + }, |
| 109 | + "devices": [ |
| 110 | + { "path": "/sys/bus/pci/devices/0000:41:00.0/" } |
| 111 | + ], |
| 112 | + "serial": { "mode": "File", "file": "/path/to/console.log" }, |
| 113 | + "console": { "mode": "Off" } |
| 114 | +} |
| 115 | +``` |
| 116 | + |
| 117 | +- [ ] Process lifecycle: |
| 118 | + 1. Start `cloud-hypervisor --api-socket /tmp/ovm-chv-{id}.sock` as subprocess |
| 119 | + 2. Wait for API socket to appear (exponential backoff, same pattern as gvproxy) |
| 120 | + 3. `PUT vm.create` with config payload |
| 121 | + 4. `PUT vm.boot` |
| 122 | + 5. Parent waits on subprocess |
| 123 | + 6. Signal forwarding: SIGINT/SIGTERM -> `PUT vm.shutdown` + subprocess SIGTERM |
| 124 | + 7. Cleanup: remove API socket |
| 125 | + |
| 126 | +### 1c. Kernel extraction and build pipeline |
| 127 | + |
| 128 | +- [ ] Modify `build-libkrun.sh`: after building libkrunfw, copy `vmlinux` from the kernel build tree to `target/libkrun-build/vmlinux` before cleanup. |
| 129 | +- [ ] Add to `openshell.kconfig` (harmless for non-GPU boots): |
| 130 | + |
| 131 | +``` |
| 132 | +CONFIG_PCI=y |
| 133 | +CONFIG_PCI_MSI=y |
| 134 | +CONFIG_DRM=y |
| 135 | +CONFIG_MODULES=y |
| 136 | +CONFIG_MODULE_UNLOAD=y |
| 137 | +``` |
| 138 | + |
| 139 | +- [ ] Add to `pins.env`: |
| 140 | + |
| 141 | +```bash |
| 142 | +CLOUD_HYPERVISOR_VERSION="${CLOUD_HYPERVISOR_VERSION:-v42.0}" |
| 143 | +VIRTIOFSD_VERSION="${VIRTIOFSD_VERSION:-v1.13.0}" |
| 144 | +``` |
| 145 | + |
| 146 | +- [ ] Create `build-cloud-hypervisor.sh` (or download step): download pre-built static binary from cloud-hypervisor GitHub releases for the target architecture. |
| 147 | +- [ ] Update `package-vm-runtime.sh`: include `cloud-hypervisor`, `vmlinux`, and `virtiofsd` in the runtime tarball for Linux builds. |
| 148 | +- [ ] `validate_runtime_dir()` in `lib.rs` must **not** require GPU binaries. Only `CloudHypervisorBackend::new()` validates their presence. |
| 149 | + |
| 150 | +### 1d. vsock exec agent compatibility |
| 151 | + |
| 152 | +libkrun uses per-port vsock bridging (`krun_add_vsock_port2`): each guest vsock port maps to a host Unix socket. cloud-hypervisor uses standard vhost-vsock with a single socket and CID-based addressing. |
| 153 | + |
| 154 | +- [ ] Update `exec.rs` to support both connection modes: |
| 155 | + - **libkrun**: connect to `vm_exec_socket_path()` (existing) |
| 156 | + - **cloud-hypervisor**: connect via `AF_VSOCK` (CID 3, port 10777) or bridge with `socat` |
| 157 | +- [ ] Test exec agent communication (cat, env) over both backends. |
| 158 | + |
| 159 | +### 1e. Plumb `--gpu` flag |
| 160 | + |
| 161 | +- [ ] Add fields to `VmConfig`: |
| 162 | + |
| 163 | +```rust |
| 164 | +pub vfio_device: Option<String>, |
| 165 | +pub gpu_enabled: bool, |
| 166 | +``` |
| 167 | + |
| 168 | +- [ ] When `gpu_enabled` is set, add `GPU_ENABLED=true` to guest environment. |
| 169 | +- [ ] Wire `--gpu` / `--gpu <pci-addr>` from the CLI to `VmConfig`. |
| 170 | + |
| 171 | +--- |
| 172 | + |
| 173 | +## Phase 1.5 -- Guest rootfs: NVIDIA driver and toolkit |
| 174 | + |
| 175 | +- [ ] **NVIDIA driver in rootfs.** Options: |
| 176 | + - **Separate GPU rootfs artifact**: build `rootfs-gpu.tar.zst` alongside `rootfs.tar.zst`. Launcher selects GPU variant when `--gpu` is passed. |
| 177 | + - **Bake into rootfs**: use `nvcr.io/nvidia/base/ubuntu` base image from `pins.env`. Heavier (~2-3 GB) but self-contained. |
| 178 | + - **Runtime injection via virtio-fs**: stage driver packages on host, mount into guest. Lighter but more complex. |
| 179 | +- [ ] **Driver version compatibility**: document minimum driver version and GPU compute capability. |
| 180 | +- [ ] **NVIDIA container toolkit**: install `nvidia-container-toolkit` so `nvidia-container-runtime` is available to containerd/k3s. |
| 181 | +- [ ] **Smoke test**: `nvidia-smi` runs inside the guest after rootfs build. |
| 182 | + |
| 183 | +--- |
| 184 | + |
| 185 | +## Phase 2 -- Guest appliance parity |
| 186 | + |
| 187 | +- [ ] **Init script changes** (`openshell-vm-init.sh`): when `GPU_ENABLED=true`: |
| 188 | + - Load NVIDIA kernel modules (`nvidia`, `nvidia_uvm`, `nvidia_modeset`) |
| 189 | + - Run `nvidia-smi` -- fail fast if device not visible |
| 190 | + - Copy `gpu-manifests/*.yaml` into k3s auto-deploy directory (mirrors `cluster-entrypoint.sh` ~line 384) |
| 191 | + - Verify `nvidia-container-runtime` is registered with containerd |
| 192 | +- [ ] **End-to-end validation**: sandbox pod requesting `nvidia.com/gpu: 1` gets scheduled and can run `nvidia-smi` inside the pod. |
| 193 | + |
| 194 | +--- |
| 195 | + |
| 196 | +## Phase 3 -- CLI / UX |
| 197 | + |
| 198 | +- [ ] Mirror `openshell gateway start --gpu` semantics for VM backend. |
| 199 | +- [ ] Support `--gpu <pci-addr>` for multi-GPU hosts. |
| 200 | +- [ ] Document host preparation (IOMMU, `vfio-pci`, unbinding `nvidia`). |
| 201 | +- [ ] Document single-GPU caveats (host display loss, headless operation). |
| 202 | + |
| 203 | +--- |
| 204 | + |
| 205 | +## Phase 4 -- CI |
| 206 | + |
| 207 | +- [ ] GPU E2E job: optional runner with `OPENSHELL_VM_GPU_E2E=1` and a VFIO-bound GPU. Tighten `nvidia_gpu_available_for_vm_passthrough()` to require `VfioBoundReady` + guest smoke. |
| 208 | +- [ ] Non-GPU cloud-hypervisor CI test: boot and exec agent check without VFIO. Catches backend regressions without GPU hardware. |
| 209 | + |
| 210 | +--- |
| 211 | + |
| 212 | +## Test evolution |
| 213 | + |
| 214 | +Today `nvidia_gpu_available_for_vm_passthrough()` returns `false`. When complete, it should compose: |
| 215 | + |
| 216 | +1. `probe_host_nvidia_vfio_readiness()` returns `VfioBoundReady` (clean IOMMU group) |
| 217 | +2. cloud-hypervisor binary present in runtime bundle |
| 218 | +3. `/dev/vfio/vfio` and `/dev/vfio/{group}` accessible |
| 219 | +4. Guest rootfs includes NVIDIA driver and toolkit |
| 220 | + |
| 221 | +Options for the final gate: |
| 222 | +- `true` only when CI env var is set and hardware verified |
| 223 | +- Replace boolean with full integration check |
| 224 | +- Remove `#[ignore]` and run only on GPU runners |
| 225 | + |
| 226 | +Pick one in the final PR so `mise run test` policy stays intentional. |
| 227 | + |
| 228 | +--- |
| 229 | + |
| 230 | +## File change index |
| 231 | + |
| 232 | +| File | Change | |
| 233 | +|---|---| |
| 234 | +| `crates/openshell-vm/src/lib.rs` | Extract `launch()` internals into backend dispatch; add `vfio_device` / `gpu_enabled` to `VmConfig` | |
| 235 | +| `crates/openshell-vm/src/backend.rs` (new) | `VmBackend` trait, `VmLaunchConfig` | |
| 236 | +| `crates/openshell-vm/src/backend/libkrun.rs` (new) | `LibkrunBackend` -- moved from `lib.rs` (mechanical refactor) | |
| 237 | +| `crates/openshell-vm/src/backend/cloud_hypervisor.rs` (new) | `CloudHypervisorBackend` -- REST API client, process lifecycle, VFIO assignment | |
| 238 | +| `crates/openshell-vm/src/ffi.rs` | No changes (used only by `LibkrunBackend`) | |
| 239 | +| `crates/openshell-vm/src/exec.rs` | Support both libkrun Unix socket and vhost-vsock connection modes | |
| 240 | +| `crates/openshell-vm/src/gpu_passthrough.rs` (move from repo root) | `probe_host_nvidia_vfio_readiness()` with IOMMU group check | |
| 241 | +| `crates/openshell-vm/runtime/kernel/openshell.kconfig` | Add `CONFIG_PCI`, `CONFIG_PCI_MSI`, `CONFIG_DRM`, `CONFIG_MODULES`, `CONFIG_MODULE_UNLOAD` | |
| 242 | +| `crates/openshell-vm/pins.env` | Add `CLOUD_HYPERVISOR_VERSION`, `VIRTIOFSD_VERSION` | |
| 243 | +| `crates/openshell-vm/scripts/openshell-vm-init.sh` | GPU-gated block: module loading, `nvidia-smi` check, manifest copy | |
| 244 | +| `tasks/scripts/vm/build-libkrun.sh` | Preserve `vmlinux` in `target/libkrun-build/` | |
| 245 | +| `tasks/scripts/vm/build-cloud-hypervisor.sh` (new) | Download or build cloud-hypervisor static binary | |
| 246 | +| `tasks/scripts/vm/package-vm-runtime.sh` | Include `cloud-hypervisor`, `vmlinux`, `virtiofsd` for Linux builds | |
0 commit comments