Skip to content

Commit 09bc754

Browse files
Add blog post detailing support for AArch64 in Winch (#132)
* Add blog post detailing support for AArch64 in Winch * Update _posts/2025-07-16-winch-aarch64-support.md --------- Co-authored-by: Oscar Spencer <oscar.spen@gmail.com>
1 parent 90b038c commit 09bc754

File tree

2 files changed

+221
-0
lines changed

2 files changed

+221
-0
lines changed
Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
---
2+
title: "Wasmtime 35 Brings AArch64 Support in Winch"
3+
author: "Saúl Cabrera"
4+
date: "2025-07-16"
5+
github_name: "saulecabrera"
6+
excerpt_separator: <!--end_excerpt-->
7+
---
8+
9+
[Wasmtime](https://wasmtime.dev/) is a fast, secure, standards
10+
compliant and lightweight WebAssembly (Wasm) runtime.
11+
12+
As of Wasmtime 35, Winch [supports AArch64 for Core
13+
Wasm](https://docs.wasmtime.dev/stability-tiers.html#aarch64)
14+
proposals, along with additional Wasm proposals like the [Component
15+
Model](https://component-model.bytecodealliance.org/) and [Custom Page
16+
Sizes](https://github.com/WebAssembly/custom-page-sizes/blob/main/proposals/custom-page-sizes/Overview.md).
17+
<!--end_excerpt-->
18+
19+
Embedders can
20+
[configure](https://docs.wasmtime.dev/api/wasmtime/struct.Config.html#method.strategy)
21+
Wasmtime to use either [Cranelift](https://cranelift.dev/) or
22+
[Winch](https://github.com/bytecodealliance/wasmtime/tree/main/winch)
23+
as the Wasm compiler depending on the use-case: Cranelift is an
24+
optimizing compiler aiming to generate fast code. Winch is a
25+
'baseline' compiler, aiming for fast compilation and low-latency
26+
startup.
27+
28+
This blog post will cover the main changes needed to accommodate
29+
support for AArch64 in Winch.
30+
31+
## Quick Tour of Winch's Architecture
32+
33+
To achieve its low-latency goal, Winch focuses on converting Wasm code
34+
to assembly code for the target Instruction Set Architecture (ISA) as
35+
quickly as possible. Unlike Cranelift, Winch's architecture
36+
intentionally avoids using an intermediate representation or complex
37+
register allocation algorithms in its compilation process. For this
38+
reason, baseline compilers are also referred to as single-pass
39+
compilers.
40+
41+
Winch's architecure can be largely divided into two parts
42+
which can be classified as ISA-agnostic and ISA-specific.
43+
44+
<img src="/articles/img/2025-07-16-winch-aarch64/compilation-process.png" alt="Winch's Architecture" />
45+
46+
Adding support for AArch64 to Winch involved adding a new
47+
implementation of the `MacroAssembler` trait, which is ultimately in
48+
charge of emitting AArch64 assembly. Winch's ISA-agnostic components
49+
remained unchanged, and shared with the existing x86_64
50+
implementation.
51+
52+
Winch's code generation context implements
53+
[`wasmparser`](https://crates.io/crates/wasmparser)'s
54+
[`VisitOperator`](https://docs.rs/wasmparser/0.235.0/wasmparser/trait.VisitOperator.html)
55+
trait, which requires defining handlers for each Wasm opcode:
56+
57+
```rust
58+
fn visit_i32_const() -> Self::Output {
59+
// Code generation starts here.
60+
}
61+
```
62+
63+
When an opcode handler is invoked, the Code Generation Context
64+
prepares all the necessary values and registers, followed by the
65+
machine code emission of the sequence of instructions to represent the
66+
Wasm instruction in the target ISA.
67+
68+
Last but not least, the register allocator algorithm uses a simple
69+
round robin approach over the available ISA registers. When a
70+
requested register is unavailable, all the current live values at the
71+
current program point are saved to memory (known as value spilling),
72+
thereby freeing the requested register for immediate use.
73+
74+
## Emitting AArch64 Assembly
75+
76+
### Shadow Stack Pointer (SSP)
77+
78+
AArch64 defines very specific restrictions with regards to the usage
79+
of the stack pointer register (SP). Concretely, SP must be 16-byte
80+
aligned whenever it is used to address stack memory. Given that
81+
Winch's register allocation algorithm requires value spilling at
82+
arbitrary program points, it can be challenging to maintain such
83+
alignment.
84+
85+
AArch64's SP requirement states that SP must be 16-byted when
86+
addressing stack memory, however it can be unaligned if not used to
87+
address stack memory and doesn't prevent using other registers for
88+
stack memory addressing, nor it states that these other registers be
89+
16-byte aligned. To avoid opting for less efficient approaches like
90+
overallocating memory to ensure alignment each time a value is saved,
91+
Winch's architecture employs a _shadow stack pointer_ approach.
92+
93+
Winch's shadow stack pointer approach defines `x28` as the base register
94+
for stack memory addressing, enabling:
95+
96+
- 8-byte stack slots for live value spilling.
97+
- 8-byte aligned stack memory loads.
98+
99+
### Signal handlers
100+
101+
Wasmtime can be
102+
[configured](https://docs.wasmtime.dev/api/wasmtime/struct.Config.html#method.signals_based_traps)
103+
to leverage signals-based traps to detect exceptional situations in
104+
Wasm programs e.g., an out-of-bounds memory access. Traps are
105+
synchronous exceptions, and when they are raised, they are caught and
106+
handled by code defined in Wasmtime's runtime. These handlers are Rust
107+
functions compiled to the target ISA, following the native calling
108+
convention, which implies that whenever there is a transition from
109+
Winch generated code to a signal handler, SP must be 16-byte
110+
aligned. Note that even though Wasmtime can be configured to avoid
111+
signals-based traps, Winch does not support such option yet.
112+
113+
Given that traps can happen at arbitrary program points, Winch's
114+
approach to ensure 16-byte alignment for SP is two-fold:
115+
116+
* Emit a series of instructions that will
117+
correctly align SP before each potentially-trapping Wasm instruction.
118+
Note that this could result in overallocation of stack memory if SP is
119+
not 16-byte aligned.
120+
* Exclusively use SSP as the canonical stack pointer value, copying
121+
the value of SSP to SP after each allocation/deallocation. This
122+
maintains the SP >= SSP invariant, which ensures that SP always
123+
reflects an overapproximation of the consumed stack space and it
124+
allows the generated code to save an extra move instruction, if
125+
overallocation due to alignment happens, as described in the
126+
previous point.
127+
128+
It's worth noting that the approach mentioned above doesn't take into
129+
account asynchronous exceptions, also known as interrupts. Further
130+
testing and development is needed in order to ensure that Winch
131+
generated code for AArch64 can correctly handle interrupts e.g.,
132+
`SIGALRM`.
133+
134+
### Immediate Value Handling
135+
136+
To minimize register pressure and reduce the need for spilling values,
137+
Winch’s instruction selection prioritizes emitting instructions that
138+
support immediate operands whenever possible, such as `mov x0,
139+
#imm`. However, due to the fixed-width instruction encoding in AArch64
140+
(which always uses 32-bit instructions), encoding large immediate
141+
values directly within a single instruction can sometimes be
142+
impossible. In such cases, the immediate is first loaded into an
143+
auxiliary register—often a "scratch" or temporary register—and then
144+
used in subsequent instructions that require register operands.
145+
146+
Scratch registers offer the advantage that they are not tracked by the
147+
register allocator, reducing the possibility of register allocator
148+
induced spills. However, they should be used sparingly and only for
149+
short-lived operations.
150+
151+
AArch64’s fixed 32-bit instruction encoding imposes stricter limits on
152+
the size of immediate values that can be encoded directly, unlike
153+
other ISAs supported by Winch, such as x86_64, which support
154+
variable-length instructions and can encode larger immediates more
155+
easily.
156+
157+
Before supporting AArch64, Winch’s ISA-agnostic component assumed a
158+
single scratch register per ISA. While this worked well for x86_64,
159+
where most instructions can encode a broad range of immediates
160+
directly, it proved problematic for AArch64. Specifically, for
161+
instruction sequences involving instructions with immediates
162+
in which the scratch register was previously acquired.
163+
164+
Consider the following snippet from Winch’s ISA-agnostic code for
165+
computing a Wasm table element address:
166+
167+
```rust
168+
// 1. Load index into the scratch register.
169+
masm.mov(scratch.writable(), index.into(), bound_size)?;
170+
// 2. Multiply with an immediate element size.
171+
masm.mul(
172+
scratch.writable(),
173+
scratch.inner(),
174+
RegImm::i32(table_data.element_size.bytes() as i32),
175+
table_data.element_size,
176+
)?;
177+
masm.load_ptr(
178+
masm.address_at_reg(base, table_data.offset)?,
179+
writable!(base),
180+
)?;
181+
masm.mov(writable!(tmp), base.into(), ptr_size)?;
182+
masm.add(writable!(base), base, scratch.inner().into(), ptr_size)
183+
```
184+
185+
In step 1, the code clobbers the designated scratch register. More
186+
critically, if the immediate passed to `Masm::mul` cannot be encoded
187+
directly in the AArch64 mul instruction, the `Masm::mul` implementation
188+
will load the immediate into a register—clobbering the scratch
189+
register again—and emit a register-based multiplication instruction.
190+
191+
One way to address this limitation is to avoid using a scratch
192+
register for the index altogether and instead request a register from
193+
the register allocator. This approach, however, increases register
194+
pressure and potentially raises memory traffic, particularly in
195+
architectures like x86_64.
196+
197+
Winch's preferred solution is to introduce an explicit scratch register
198+
allocator that provides a small pool of scratch registers (e.g., x16
199+
and x17 in AArch64). By managing scratch registers explicitly, Winch
200+
can safely allocate and use them without risking accidental
201+
clobbering, especially when generating code for architectures with
202+
stricter immediate encoding constraints.
203+
204+
## What's Next
205+
206+
Though it wasn't a radical change, the completeness of AArch64 in
207+
Winch marks a new stage for the compiler's architecture, layering a
208+
more robust and solid foundation for future ISA additions.
209+
210+
Contributions are welcome! If you're interested in contributing, you can:
211+
212+
* Start by reading [Wasmtime's contributing documentation](https://docs.wasmtime.dev/contributing.html)
213+
* Checkout [Winch's project board](https://github.com/orgs/bytecodealliance/projects/12/views/4)
214+
215+
## That's a wrap
216+
217+
Thanks to everyone who [contributed](https://github.com/bytecodealliance/wasmtime/issues/8321)
218+
to the completeness of the AArch64 backend!
219+
Thanks also to [Nick Fitzgerald](https://github.com/fitzgen) and
220+
[Chris Fallin](https://github.com/cfallin) for their feedback on early
221+
drafts of this article.
231 KB
Loading

0 commit comments

Comments
 (0)