|
| 1 | +--- |
| 2 | +title: "Wasmtime 35 Brings AArch64 Support in Winch" |
| 3 | +author: "Saúl Cabrera" |
| 4 | +date: "2025-07-16" |
| 5 | +github_name: "saulecabrera" |
| 6 | +excerpt_separator: <!--end_excerpt--> |
| 7 | +--- |
| 8 | + |
| 9 | +[Wasmtime](https://wasmtime.dev/) is a fast, secure, standards |
| 10 | +compliant and lightweight WebAssembly (Wasm) runtime. |
| 11 | + |
| 12 | +As of Wasmtime 35, Winch [supports AArch64 for Core |
| 13 | +Wasm](https://docs.wasmtime.dev/stability-tiers.html#aarch64) |
| 14 | +proposals, along with additional Wasm proposals like the [Component |
| 15 | +Model](https://component-model.bytecodealliance.org/) and [Custom Page |
| 16 | +Sizes](https://github.com/WebAssembly/custom-page-sizes/blob/main/proposals/custom-page-sizes/Overview.md). |
| 17 | +<!--end_excerpt--> |
| 18 | + |
| 19 | +Embedders can |
| 20 | +[configure](https://docs.wasmtime.dev/api/wasmtime/struct.Config.html#method.strategy) |
| 21 | +Wasmtime to use either [Cranelift](https://cranelift.dev/) or |
| 22 | +[Winch](https://github.com/bytecodealliance/wasmtime/tree/main/winch) |
| 23 | +as the Wasm compiler depending on the use-case: Cranelift is an |
| 24 | +optimizing compiler aiming to generate fast code. Winch is a |
| 25 | +'baseline' compiler, aiming for fast compilation and low-latency |
| 26 | +startup. |
| 27 | + |
| 28 | +This blog post will cover the main changes needed to accommodate |
| 29 | +support for AArch64 in Winch. |
| 30 | + |
| 31 | +## Quick Tour of Winch's Architecture |
| 32 | + |
| 33 | +To achieve its low-latency goal, Winch focuses on converting Wasm code |
| 34 | +to assembly code for the target Instruction Set Architecture (ISA) as |
| 35 | +quickly as possible. Unlike Cranelift, Winch's architecture |
| 36 | +intentionally avoids using an intermediate representation or complex |
| 37 | +register allocation algorithms in its compilation process. For this |
| 38 | +reason, baseline compilers are also referred to as single-pass |
| 39 | +compilers. |
| 40 | + |
| 41 | +Winch's architecure can be largely divided into two parts |
| 42 | +which can be classified as ISA-agnostic and ISA-specific. |
| 43 | + |
| 44 | +<img src="/articles/img/2025-07-16-winch-aarch64/compilation-process.png" alt="Winch's Architecture" /> |
| 45 | + |
| 46 | +Adding support for AArch64 to Winch involved adding a new |
| 47 | +implementation of the `MacroAssembler` trait, which is ultimately in |
| 48 | +charge of emitting AArch64 assembly. Winch's ISA-agnostic components |
| 49 | +remained unchanged, and shared with the existing x86_64 |
| 50 | +implementation. |
| 51 | + |
| 52 | +Winch's code generation context implements |
| 53 | +[`wasmparser`](https://crates.io/crates/wasmparser)'s |
| 54 | +[`VisitOperator`](https://docs.rs/wasmparser/0.235.0/wasmparser/trait.VisitOperator.html) |
| 55 | +trait, which requires defining handlers for each Wasm opcode: |
| 56 | + |
| 57 | +```rust |
| 58 | +fn visit_i32_const() -> Self::Output { |
| 59 | + // Code generation starts here. |
| 60 | +} |
| 61 | +``` |
| 62 | + |
| 63 | +When an opcode handler is invoked, the Code Generation Context |
| 64 | +prepares all the necessary values and registers, followed by the |
| 65 | +machine code emission of the sequence of instructions to represent the |
| 66 | +Wasm instruction in the target ISA. |
| 67 | + |
| 68 | +Last but not least, the register allocator algorithm uses a simple |
| 69 | +round robin approach over the available ISA registers. When a |
| 70 | +requested register is unavailable, all the current live values at the |
| 71 | +current program point are saved to memory (known as value spilling), |
| 72 | +thereby freeing the requested register for immediate use. |
| 73 | + |
| 74 | +## Emitting AArch64 Assembly |
| 75 | + |
| 76 | +### Shadow Stack Pointer (SSP) |
| 77 | + |
| 78 | +AArch64 defines very specific restrictions with regards to the usage |
| 79 | +of the stack pointer register (SP). Concretely, SP must be 16-byte |
| 80 | +aligned whenever it is used to address stack memory. Given that |
| 81 | +Winch's register allocation algorithm requires value spilling at |
| 82 | +arbitrary program points, it can be challenging to maintain such |
| 83 | +alignment. |
| 84 | + |
| 85 | +AArch64's SP requirement states that SP must be 16-byted when |
| 86 | +addressing stack memory, however it can be unaligned if not used to |
| 87 | +address stack memory and doesn't prevent using other registers for |
| 88 | +stack memory addressing, nor it states that these other registers be |
| 89 | +16-byte aligned. To avoid opting for less efficient approaches like |
| 90 | +overallocating memory to ensure alignment each time a value is saved, |
| 91 | +Winch's architecture employs a _shadow stack pointer_ approach. |
| 92 | + |
| 93 | +Winch's shadow stack pointer approach defines `x28` as the base register |
| 94 | +for stack memory addressing, enabling: |
| 95 | + |
| 96 | +- 8-byte stack slots for live value spilling. |
| 97 | +- 8-byte aligned stack memory loads. |
| 98 | + |
| 99 | +### Signal handlers |
| 100 | + |
| 101 | +Wasmtime can be |
| 102 | +[configured](https://docs.wasmtime.dev/api/wasmtime/struct.Config.html#method.signals_based_traps) |
| 103 | +to leverage signals-based traps to detect exceptional situations in |
| 104 | +Wasm programs e.g., an out-of-bounds memory access. Traps are |
| 105 | +synchronous exceptions, and when they are raised, they are caught and |
| 106 | +handled by code defined in Wasmtime's runtime. These handlers are Rust |
| 107 | +functions compiled to the target ISA, following the native calling |
| 108 | +convention, which implies that whenever there is a transition from |
| 109 | +Winch generated code to a signal handler, SP must be 16-byte |
| 110 | +aligned. Note that even though Wasmtime can be configured to avoid |
| 111 | +signals-based traps, Winch does not support such option yet. |
| 112 | + |
| 113 | +Given that traps can happen at arbitrary program points, Winch's |
| 114 | +approach to ensure 16-byte alignment for SP is two-fold: |
| 115 | + |
| 116 | +* Emit a series of instructions that will |
| 117 | + correctly align SP before each potentially-trapping Wasm instruction. |
| 118 | + Note that this could result in overallocation of stack memory if SP is |
| 119 | + not 16-byte aligned. |
| 120 | +* Exclusively use SSP as the canonical stack pointer value, copying |
| 121 | + the value of SSP to SP after each allocation/deallocation. This |
| 122 | + maintains the SP >= SSP invariant, which ensures that SP always |
| 123 | + reflects an overapproximation of the consumed stack space and it |
| 124 | + allows the generated code to save an extra move instruction, if |
| 125 | + overallocation due to alignment happens, as described in the |
| 126 | + previous point. |
| 127 | + |
| 128 | +It's worth noting that the approach mentioned above doesn't take into |
| 129 | +account asynchronous exceptions, also known as interrupts. Further |
| 130 | +testing and development is needed in order to ensure that Winch |
| 131 | +generated code for AArch64 can correctly handle interrupts e.g., |
| 132 | +`SIGALRM`. |
| 133 | + |
| 134 | +### Immediate Value Handling |
| 135 | + |
| 136 | +To minimize register pressure and reduce the need for spilling values, |
| 137 | +Winch’s instruction selection prioritizes emitting instructions that |
| 138 | +support immediate operands whenever possible, such as `mov x0, |
| 139 | +#imm`. However, due to the fixed-width instruction encoding in AArch64 |
| 140 | +(which always uses 32-bit instructions), encoding large immediate |
| 141 | +values directly within a single instruction can sometimes be |
| 142 | +impossible. In such cases, the immediate is first loaded into an |
| 143 | +auxiliary register—often a "scratch" or temporary register—and then |
| 144 | +used in subsequent instructions that require register operands. |
| 145 | + |
| 146 | +Scratch registers offer the advantage that they are not tracked by the |
| 147 | +register allocator, reducing the possibility of register allocator |
| 148 | +induced spills. However, they should be used sparingly and only for |
| 149 | +short-lived operations. |
| 150 | + |
| 151 | +AArch64’s fixed 32-bit instruction encoding imposes stricter limits on |
| 152 | +the size of immediate values that can be encoded directly, unlike |
| 153 | +other ISAs supported by Winch, such as x86_64, which support |
| 154 | +variable-length instructions and can encode larger immediates more |
| 155 | +easily. |
| 156 | + |
| 157 | +Before supporting AArch64, Winch’s ISA-agnostic component assumed a |
| 158 | +single scratch register per ISA. While this worked well for x86_64, |
| 159 | +where most instructions can encode a broad range of immediates |
| 160 | +directly, it proved problematic for AArch64. Specifically, for |
| 161 | +instruction sequences involving instructions with immediates |
| 162 | +in which the scratch register was previously acquired. |
| 163 | + |
| 164 | +Consider the following snippet from Winch’s ISA-agnostic code for |
| 165 | +computing a Wasm table element address: |
| 166 | + |
| 167 | +```rust |
| 168 | +// 1. Load index into the scratch register. |
| 169 | +masm.mov(scratch.writable(), index.into(), bound_size)?; |
| 170 | +// 2. Multiply with an immediate element size. |
| 171 | +masm.mul( |
| 172 | + scratch.writable(), |
| 173 | + scratch.inner(), |
| 174 | + RegImm::i32(table_data.element_size.bytes() as i32), |
| 175 | + table_data.element_size, |
| 176 | +)?; |
| 177 | +masm.load_ptr( |
| 178 | + masm.address_at_reg(base, table_data.offset)?, |
| 179 | + writable!(base), |
| 180 | +)?; |
| 181 | +masm.mov(writable!(tmp), base.into(), ptr_size)?; |
| 182 | +masm.add(writable!(base), base, scratch.inner().into(), ptr_size) |
| 183 | +``` |
| 184 | + |
| 185 | +In step 1, the code clobbers the designated scratch register. More |
| 186 | +critically, if the immediate passed to `Masm::mul` cannot be encoded |
| 187 | +directly in the AArch64 mul instruction, the `Masm::mul` implementation |
| 188 | +will load the immediate into a register—clobbering the scratch |
| 189 | +register again—and emit a register-based multiplication instruction. |
| 190 | + |
| 191 | +One way to address this limitation is to avoid using a scratch |
| 192 | +register for the index altogether and instead request a register from |
| 193 | +the register allocator. This approach, however, increases register |
| 194 | +pressure and potentially raises memory traffic, particularly in |
| 195 | +architectures like x86_64. |
| 196 | + |
| 197 | +Winch's preferred solution is to introduce an explicit scratch register |
| 198 | +allocator that provides a small pool of scratch registers (e.g., x16 |
| 199 | +and x17 in AArch64). By managing scratch registers explicitly, Winch |
| 200 | +can safely allocate and use them without risking accidental |
| 201 | +clobbering, especially when generating code for architectures with |
| 202 | +stricter immediate encoding constraints. |
| 203 | + |
| 204 | +## What's Next |
| 205 | + |
| 206 | +Though it wasn't a radical change, the completeness of AArch64 in |
| 207 | +Winch marks a new stage for the compiler's architecture, layering a |
| 208 | +more robust and solid foundation for future ISA additions. |
| 209 | + |
| 210 | +Contributions are welcome! If you're interested in contributing, you can: |
| 211 | + |
| 212 | +* Start by reading [Wasmtime's contributing documentation](https://docs.wasmtime.dev/contributing.html) |
| 213 | +* Checkout [Winch's project board](https://github.com/orgs/bytecodealliance/projects/12/views/4) |
| 214 | + |
| 215 | +## That's a wrap |
| 216 | + |
| 217 | +Thanks to everyone who [contributed](https://github.com/bytecodealliance/wasmtime/issues/8321) |
| 218 | +to the completeness of the AArch64 backend! |
| 219 | +Thanks also to [Nick Fitzgerald](https://github.com/fitzgen) and |
| 220 | +[Chris Fallin](https://github.com/cfallin) for their feedback on early |
| 221 | +drafts of this article. |
0 commit comments