Skip to content

Commit 5fb51ac

Browse files
GPU support design and implementation plan
Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>
1 parent ac3fc48 commit 5fb51ac

2 files changed

Lines changed: 471 additions & 0 deletions

File tree

Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
# VM GPU passthrough: implementation plan
2+
3+
> Design: [vm-gpu-passthrough.md](vm-gpu-passthrough.md)
4+
5+
## Phase 0 -- Specification and failing test (current)
6+
7+
- [x] Design doc.
8+
- [x] Phase 0.5 VMM decision (cloud-hypervisor selected).
9+
- [ ] **`gpu_passthrough` module** integrated into `crates/openshell-vm/src/`:
10+
- `probe_host_nvidia_vfio_readiness()` -- Linux sysfs scan; non-Linux returns `UnsupportedPlatform`.
11+
- `nvidia_gpu_available_for_vm_passthrough()` -- hard-coded `false` until end-to-end passthrough works.
12+
- **Note:** `gpu_passthrough.rs` and `gpu_passthrough_implementation.rs` exist as untracked files at the repo root but are not wired into the crate module tree (`lib.rs` does not `mod gpu_passthrough;`). Move them into `crates/openshell-vm/src/`, add `pub mod gpu_passthrough;`, and ensure `cargo test -p openshell-vm` compiles them.
13+
- [ ] **Failing integration test** `tests/gpu_passthrough_implementation.rs` -- documents the target and fails until implementation is finished.
14+
15+
**Running the red test:** `cargo test -p openshell-vm --test gpu_passthrough_implementation`
16+
17+
**Note:** `mise run test` uses `cargo test --workspace --exclude openshell-vm`, so default CI stays green.
18+
19+
---
20+
21+
## Phase 1 -- VMM backend abstraction and cloud-hypervisor integration
22+
23+
### 1a. Backend trait and libkrun extraction
24+
25+
Refactor only -- no behavior changes. Existing tests must still pass.
26+
27+
- [ ] Create `src/backend.rs` with the `VmBackend` trait:
28+
29+
```rust
30+
pub trait VmBackend {
31+
fn launch(&self, config: &VmLaunchConfig) -> Result<i32, VmError>;
32+
}
33+
34+
pub struct VmLaunchConfig {
35+
pub base: VmConfig,
36+
pub vfio_device: Option<String>,
37+
}
38+
```
39+
40+
- [ ] Create `src/backend/libkrun.rs` -- move into `LibkrunBackend`:
41+
- `VmContext` struct and all methods (current `lib.rs` lines 584-811)
42+
- gvproxy setup block inside `NetBackend::Gvproxy` (lines 1337-1466)
43+
- fork + waitpid + signal forwarding (lines 1525-1710)
44+
- bootstrap block (lines 1648-1663)
45+
- [ ] Extract shared gvproxy startup into a helper used by both backends.
46+
- [ ] Update `launch()` to dispatch:
47+
48+
```rust
49+
pub fn launch(config: &VmLaunchConfig) -> Result<i32, VmError> {
50+
// ... existing pre-launch checks ...
51+
52+
if config.vfio_device.is_some() {
53+
#[cfg(not(target_os = "linux"))]
54+
return Err(VmError::HostSetup(
55+
"GPU passthrough requires Linux with KVM and IOMMU".into(),
56+
));
57+
58+
#[cfg(target_os = "linux")]
59+
{
60+
let backend = CloudHypervisorBackend::new()?;
61+
return backend.launch(config);
62+
}
63+
}
64+
65+
LibkrunBackend.launch(config)
66+
}
67+
```
68+
69+
- [ ] `ffi.rs` stays as-is -- only used by `LibkrunBackend`.
70+
71+
### 1b. cloud-hypervisor backend
72+
73+
- [ ] Create `src/backend/cloud_hypervisor.rs` implementing `VmBackend`.
74+
- [ ] REST API client -- HTTP/1.1 over Unix socket, ~5 endpoints:
75+
76+
```
77+
PUT /api/v1/vm.create -- configure VM
78+
PUT /api/v1/vm.boot -- start VM
79+
PUT /api/v1/vm.shutdown -- graceful stop
80+
GET /api/v1/vm.info -- status check
81+
PUT /api/v1/vm.delete -- cleanup
82+
```
83+
84+
Use `hyper` over Unix socket (already in dependency tree) or raw HTTP. Avoid adding `cloud-hypervisor-client` crate for ~5 calls.
85+
86+
- [ ] VM create payload mapping from `VmConfig`:
87+
88+
```json
89+
{
90+
"cpus": { "boot_vcpus": 4 },
91+
"memory": { "size": 8589934592 },
92+
"payload": {
93+
"kernel": "/path/to/vmlinux",
94+
"cmdline": "console=hvc0 root=virtiofs:rootfs rw init=/srv/openshell-vm-init.sh"
95+
},
96+
"fs": [
97+
{ "tag": "rootfs", "socket": "/path/to/virtiofsd.sock", "num_queues": 1, "queue_size": 1024 }
98+
],
99+
"disks": [
100+
{ "path": "/path/to/state.raw", "readonly": false }
101+
],
102+
"net": [
103+
{ "socket": "/path/to/gvproxy-qemu.sock", "mac": "5a:94:ef:e4:0c:ee" }
104+
],
105+
"vsock": {
106+
"cid": 3,
107+
"socket": "/path/to/vsock.sock"
108+
},
109+
"devices": [
110+
{ "path": "/sys/bus/pci/devices/0000:41:00.0/" }
111+
],
112+
"serial": { "mode": "File", "file": "/path/to/console.log" },
113+
"console": { "mode": "Off" }
114+
}
115+
```
116+
117+
- [ ] Process lifecycle:
118+
1. Start `cloud-hypervisor --api-socket /tmp/ovm-chv-{id}.sock` as subprocess
119+
2. Wait for API socket to appear (exponential backoff, same pattern as gvproxy)
120+
3. `PUT vm.create` with config payload
121+
4. `PUT vm.boot`
122+
5. Parent waits on subprocess
123+
6. Signal forwarding: SIGINT/SIGTERM -> `PUT vm.shutdown` + subprocess SIGTERM
124+
7. Cleanup: remove API socket
125+
126+
### 1c. Kernel extraction and build pipeline
127+
128+
- [ ] Modify `build-libkrun.sh`: after building libkrunfw, copy `vmlinux` from the kernel build tree to `target/libkrun-build/vmlinux` before cleanup.
129+
- [ ] Add to `openshell.kconfig` (harmless for non-GPU boots):
130+
131+
```
132+
CONFIG_PCI=y
133+
CONFIG_PCI_MSI=y
134+
CONFIG_DRM=y
135+
CONFIG_MODULES=y
136+
CONFIG_MODULE_UNLOAD=y
137+
```
138+
139+
- [ ] Add to `pins.env`:
140+
141+
```bash
142+
CLOUD_HYPERVISOR_VERSION="${CLOUD_HYPERVISOR_VERSION:-v42.0}"
143+
VIRTIOFSD_VERSION="${VIRTIOFSD_VERSION:-v1.13.0}"
144+
```
145+
146+
- [ ] Create `build-cloud-hypervisor.sh` (or download step): download pre-built static binary from cloud-hypervisor GitHub releases for the target architecture.
147+
- [ ] Update `package-vm-runtime.sh`: include `cloud-hypervisor`, `vmlinux`, and `virtiofsd` in the runtime tarball for Linux builds.
148+
- [ ] `validate_runtime_dir()` in `lib.rs` must **not** require GPU binaries. Only `CloudHypervisorBackend::new()` validates their presence.
149+
150+
### 1d. vsock exec agent compatibility
151+
152+
libkrun uses per-port vsock bridging (`krun_add_vsock_port2`): each guest vsock port maps to a host Unix socket. cloud-hypervisor uses standard vhost-vsock with a single socket and CID-based addressing.
153+
154+
- [ ] Update `exec.rs` to support both connection modes:
155+
- **libkrun**: connect to `vm_exec_socket_path()` (existing)
156+
- **cloud-hypervisor**: connect via `AF_VSOCK` (CID 3, port 10777) or bridge with `socat`
157+
- [ ] Test exec agent communication (cat, env) over both backends.
158+
159+
### 1e. Plumb `--gpu` flag
160+
161+
- [ ] Add fields to `VmConfig`:
162+
163+
```rust
164+
pub vfio_device: Option<String>,
165+
pub gpu_enabled: bool,
166+
```
167+
168+
- [ ] When `gpu_enabled` is set, add `GPU_ENABLED=true` to guest environment.
169+
- [ ] Wire `--gpu` / `--gpu <pci-addr>` from the CLI to `VmConfig`.
170+
171+
---
172+
173+
## Phase 1.5 -- Guest rootfs: NVIDIA driver and toolkit
174+
175+
- [ ] **NVIDIA driver in rootfs.** Options:
176+
- **Separate GPU rootfs artifact**: build `rootfs-gpu.tar.zst` alongside `rootfs.tar.zst`. Launcher selects GPU variant when `--gpu` is passed.
177+
- **Bake into rootfs**: use `nvcr.io/nvidia/base/ubuntu` base image from `pins.env`. Heavier (~2-3 GB) but self-contained.
178+
- **Runtime injection via virtio-fs**: stage driver packages on host, mount into guest. Lighter but more complex.
179+
- [ ] **Driver version compatibility**: document minimum driver version and GPU compute capability.
180+
- [ ] **NVIDIA container toolkit**: install `nvidia-container-toolkit` so `nvidia-container-runtime` is available to containerd/k3s.
181+
- [ ] **Smoke test**: `nvidia-smi` runs inside the guest after rootfs build.
182+
183+
---
184+
185+
## Phase 2 -- Guest appliance parity
186+
187+
- [ ] **Init script changes** (`openshell-vm-init.sh`): when `GPU_ENABLED=true`:
188+
- Load NVIDIA kernel modules (`nvidia`, `nvidia_uvm`, `nvidia_modeset`)
189+
- Run `nvidia-smi` -- fail fast if device not visible
190+
- Copy `gpu-manifests/*.yaml` into k3s auto-deploy directory (mirrors `cluster-entrypoint.sh` ~line 384)
191+
- Verify `nvidia-container-runtime` is registered with containerd
192+
- [ ] **End-to-end validation**: sandbox pod requesting `nvidia.com/gpu: 1` gets scheduled and can run `nvidia-smi` inside the pod.
193+
194+
---
195+
196+
## Phase 3 -- CLI / UX
197+
198+
- [ ] Mirror `openshell gateway start --gpu` semantics for VM backend.
199+
- [ ] Support `--gpu <pci-addr>` for multi-GPU hosts.
200+
- [ ] Document host preparation (IOMMU, `vfio-pci`, unbinding `nvidia`).
201+
- [ ] Document single-GPU caveats (host display loss, headless operation).
202+
203+
---
204+
205+
## Phase 4 -- CI
206+
207+
- [ ] GPU E2E job: optional runner with `OPENSHELL_VM_GPU_E2E=1` and a VFIO-bound GPU. Tighten `nvidia_gpu_available_for_vm_passthrough()` to require `VfioBoundReady` + guest smoke.
208+
- [ ] Non-GPU cloud-hypervisor CI test: boot and exec agent check without VFIO. Catches backend regressions without GPU hardware.
209+
210+
---
211+
212+
## Test evolution
213+
214+
Today `nvidia_gpu_available_for_vm_passthrough()` returns `false`. When complete, it should compose:
215+
216+
1. `probe_host_nvidia_vfio_readiness()` returns `VfioBoundReady` (clean IOMMU group)
217+
2. cloud-hypervisor binary present in runtime bundle
218+
3. `/dev/vfio/vfio` and `/dev/vfio/{group}` accessible
219+
4. Guest rootfs includes NVIDIA driver and toolkit
220+
221+
Options for the final gate:
222+
- `true` only when CI env var is set and hardware verified
223+
- Replace boolean with full integration check
224+
- Remove `#[ignore]` and run only on GPU runners
225+
226+
Pick one in the final PR so `mise run test` policy stays intentional.
227+
228+
---
229+
230+
## File change index
231+
232+
| File | Change |
233+
|---|---|
234+
| `crates/openshell-vm/src/lib.rs` | Extract `launch()` internals into backend dispatch; add `vfio_device` / `gpu_enabled` to `VmConfig` |
235+
| `crates/openshell-vm/src/backend.rs` (new) | `VmBackend` trait, `VmLaunchConfig` |
236+
| `crates/openshell-vm/src/backend/libkrun.rs` (new) | `LibkrunBackend` -- moved from `lib.rs` (mechanical refactor) |
237+
| `crates/openshell-vm/src/backend/cloud_hypervisor.rs` (new) | `CloudHypervisorBackend` -- REST API client, process lifecycle, VFIO assignment |
238+
| `crates/openshell-vm/src/ffi.rs` | No changes (used only by `LibkrunBackend`) |
239+
| `crates/openshell-vm/src/exec.rs` | Support both libkrun Unix socket and vhost-vsock connection modes |
240+
| `crates/openshell-vm/src/gpu_passthrough.rs` (move from repo root) | `probe_host_nvidia_vfio_readiness()` with IOMMU group check |
241+
| `crates/openshell-vm/runtime/kernel/openshell.kconfig` | Add `CONFIG_PCI`, `CONFIG_PCI_MSI`, `CONFIG_DRM`, `CONFIG_MODULES`, `CONFIG_MODULE_UNLOAD` |
242+
| `crates/openshell-vm/pins.env` | Add `CLOUD_HYPERVISOR_VERSION`, `VIRTIOFSD_VERSION` |
243+
| `crates/openshell-vm/scripts/openshell-vm-init.sh` | GPU-gated block: module loading, `nvidia-smi` check, manifest copy |
244+
| `tasks/scripts/vm/build-libkrun.sh` | Preserve `vmlinux` in `target/libkrun-build/` |
245+
| `tasks/scripts/vm/build-cloud-hypervisor.sh` (new) | Download or build cloud-hypervisor static binary |
246+
| `tasks/scripts/vm/package-vm-runtime.sh` | Include `cloud-hypervisor`, `vmlinux`, `virtiofsd` for Linux builds |

0 commit comments

Comments
 (0)