Skip to content

Fix a couple of issues that prevent wasmtime for compiling/running on arm64_32 (Apple Watch)#13259

Open
matthargett wants to merge 3 commits intobytecodealliance:mainfrom
rebeckerspecialties:unwinder-arm64_32-asm-format
Open

Fix a couple of issues that prevent wasmtime for compiling/running on arm64_32 (Apple Watch)#13259
matthargett wants to merge 3 commits intobytecodealliance:mainfrom
rebeckerspecialties:unwinder-arm64_32-asm-format

Conversation

@matthargett
Copy link
Copy Markdown

@matthargett matthargett commented May 4, 2026

Two-commit series enabling wasmtime to build for arm64_32-apple-watchos
(Apple Watch Series 4+ ILP32 ABI). Verified end-to-end on Apple Watch SE 2
(S8 SoC, watchOS 11) and iPhone XS (A12, iOS 18) running an 11-workload
Pulley benchmark, with WAMR fast-interp as a side-by-side comparison
runtime.

Commit 1 — unwinder: type aarch64 register-bearing locals as u64

crates/unwinder/src/arch/aarch64.rs has inline-asm operands that take
register-width values. They were typed usize, which works on the usual
aarch64-* LP64 targets where usize is u64 and the operand class is
unambiguously the 64-bit GPR view. On arm64_32-apple-watchos (ILP32
ABI: 64-bit registers, 32-bit pointers) usize is u32, which makes
the same operands ambiguous between the w<N> (32-bit lane) and x<N>
(64-bit GPR) views — exactly what rustc's asm_sub_register lint flags.
Relying on the ISA-side zero-extend that aarch64 happens to perform on
mov w<N>, ... would also be relying on a property the language doesn't
promise: the Rust Reference is explicit that the upper bits of a
register holding a sub-register-width input are undefined (see
https://doc.rust-lang.org/reference/inline-assembly.html#r-asm.register-operands.smaller-value).

Rather than leak u64 into the public surface (the Unwind trait, the
shared arch/mod.rs dispatch, and the per-arch backends in x86.rs /
riscv64.rs / s390x.rs), keep the public function signatures usize
— that's the existing convention shared with the other backends, and
the u64-vs-pointer-width split is unique to aarch64-on-ILP32. Inside
aarch64.rs only, type any register-bearing local that participates in
inline asm as u64, and cast at the boundaries:

  • u64::try_from(v).unwrap() widens usizeu64 (infallible on
    every supported Rust target, the .unwrap() documents that any
    failure would be a target-property issue rather than a runtime one).
  • as usize narrows u64usize at the return — truncates on
    arm64_32 by design (the saved PC/SP there is a 32-bit host
    pointer that fits exactly in the low 32 bits) and is the identity
    on aarch64 LP64.

Also switch the saved-LR load from *(fp as *mut usize).offset(1) to
*(fp as *mut u64).offset(1). AAPCS64 reserves two 64-bit slots for
the frame record on every aarch64 ABI variant — including arm64_32
so an 8-byte stride is correct regardless of pointer width. With
*mut usize on arm64_32 .offset(1) would advance by only 4 bytes
and read the upper half of the saved-FP slot. This is a latent
correctness fix; today the unwinder isn't exercised on arm64_32
(which runs Pulley, not Cranelift-compiled native code), but the
corrected form is the right one to land alongside the type change.

Diff is one file (crates/unwinder/src/arch/aarch64.rs, +69 / -4).
No behaviour change on existing aarch64 LP64 targets; silences two
asm_sub_register warnings on a future arm64_32-apple-watchos build.

Commit 2 — Bump mach2 dep from 0.4.2 to 0.6

mach2 v0.4.2 emits compile_error!("mach requires macOS or iOS") on
any target where neither target_os = "macos" nor ios matches, plus
a matching narrow target_vendor gate on its libc build-dep. That
blocks Apple watchOS / tvOS / visionOS targets — wasmtime's runtime
feature pulls mach2 in unconditionally so the build fails with both
error: mach requires macOS or iOS and error[E0463]: can't find crate for libc.

The fix has been upstream in mach2 since 0.6.0 (commit 538ce75,
2025-08-16, "Add support for tvOS, watchOS and visionOS"): both gates
widen to cfg(target_vendor = "apple"). The mach2 module API wasmtime
imports (exc, exception_types, kern_return, mach_init,
mach_port, message, ndr, port, thread_act, thread_status)
is unchanged between 0.4.2 and 0.6.0; only internal libc/core::ffi
type-plumbing differs. Bumping the workspace dep is sufficient — no
changes in machports.rs.

Verified by building wasmtime as a staticlib for
arm64_32-apple-watchos under
nightly-2026-01-25 + -Z build-std=std,panic_abort with
--features pulley,runtime,std,cranelift,anyhow. The dev-only path
(cranelift-jit -> region -> mach2 0.4.x) keeps an older mach2 in the
lockfile for cranelift-jit's own host tests; that path is not part of
any production embedder build and stays unchanged. cargo deny flags
the resulting two mach2 versions but region is already in
skip-tree, so no deny.toml change is needed; the right long-term
fix is for region to update. @alexcrichton is preparing a
cargo vet audit update for the new mach2 0.6.0 separately.

End-to-end verification

This 2-commit stack + the companion target-lexicon Arm64_32 patch
(submitted separately to bytecodealliance/target-lexicon) is enough to
build a Pulley-only static library for arm64_32-apple-watchos and link
it into a watchOS app. On real hardware:

Apple Watch SE 2 (S8 SoC, watchOS 11, arm64_32-apple-watchos)

workload Pulley WAMR fast-interp winner
fib(30) 132.04 ms 165.05 ms Pulley +25%
fib_tail(100000) [return_call] 0.566 ms 0.886 ms Pulley +57%
factorial(20) <1 µs <1 µs tie
sieve(10000) 0.762 ms 0.938 ms Pulley +23%
crc32(64 KiB) 5.301 ms 5.153 ms WAMR +3%
matmul simd128 64×64 3.986 ms 9.398 ms Pulley +136%
matmul relaxed-simd FMA 3.143 ms err — not in WAMR Pulley
convolution 256×256 10.549 ms 11.789 ms Pulley +12%
audio DSP 1000×512 1471.56 ms 1060.26 ms WAMR +39%
bulk_memory (memory.copy/fill) 31.564 ms 15.644 ms WAMR +102%
call_indirect (200 K dispatches) 36.727 ms 23.260 ms WAMR +58%

iPhone XS (A12, iOS 18, aarch64-apple-ios)

workload Pulley WAMR fast-interp winner
fib(30) 41.147 ms 49.371 ms Pulley +20%
fib_tail(100000) [return_call] 0.252 ms 0.382 ms Pulley +52%
sieve(10000) 0.340 ms 0.269 ms WAMR +26%
matmul simd128 64×64 1.697 ms 3.228 ms Pulley +90%
matmul relaxed-simd FMA 1.339 ms err — not in WAMR Pulley
audio DSP 1000×512 536.97 ms 418.73 ms WAMR +28%
bulk_memory (memory.copy/fill) 10.824 ms 4.962 ms WAMR +118%
call_indirect (200 K dispatches) 17.621 ms 8.861 ms WAMR +99%

All results match the host-Rust reference function byte-for-byte
across both runtimes.

@matthargett matthargett requested review from a team as code owners May 4, 2026 03:51
@matthargett matthargett requested review from dicej and removed request for a team May 4, 2026 03:51
@matthargett matthargett changed the title Unwinder arm64 32 asm format Fix a couple of issues that precent wasmtime for compiling/running on arm64_32 (Apple Watch0 May 4, 2026
@matthargett matthargett changed the title Fix a couple of issues that precent wasmtime for compiling/running on arm64_32 (Apple Watch0 Fix a couple of issues that precent wasmtime for compiling/running on arm64_32 (Apple Watch) May 4, 2026
@matthargett matthargett changed the title Fix a couple of issues that precent wasmtime for compiling/running on arm64_32 (Apple Watch) Fix a couple of issues that prevent wasmtime for compiling/running on arm64_32 (Apple Watch) May 4, 2026
@alexcrichton alexcrichton requested review from alexcrichton and removed request for dicej May 4, 2026 16:26
Comment thread crates/unwinder/src/arch/aarch64.rs Outdated
Comment thread Cargo.toml
Comment thread deny.toml Outdated
@alexcrichton
Copy link
Copy Markdown
Member

Wanted to say again thanks for the porting work here and even the benchmark work as well, it's much appreciated!

@matthargett matthargett force-pushed the unwinder-arm64_32-asm-format branch from 7ab1b5f to 48096e4 Compare May 4, 2026 20:20
Comment thread crates/unwinder/src/arch/aarch64.rs Outdated
`crates/unwinder/src/arch/aarch64.rs` has inline-asm operands that take
register-width values. They were typed `usize`, which works on the usual
`aarch64-*` LP64 targets where `usize` is `u64` and the operand class is
unambiguously the 64-bit GPR view. On `arm64_32-apple-watchos` (ILP32
ABI: 64-bit registers, 32-bit pointers) `usize` is `u32`, which makes
the same operands ambiguous between the `w<N>` (32-bit lane) and `x<N>`
(64-bit GPR) views — exactly what rustc's `asm_sub_register` lint flags.
Relying on the ISA-side zero-extend that aarch64 happens to perform on
`mov w<N>, ...` would also be relying on a property the language
doesn't promise: the Rust Reference is explicit that the upper bits of
a register holding a sub-register-width input are *undefined*[0].

Rather than leak `u64` into the public surface (the `Unwind` trait, the
shared `arch/mod.rs` dispatch, and the per-arch backends in `x86.rs`,
`riscv64.rs`, `s390x.rs`), keep the public function signatures `usize`
— that's the existing convention shared with the other backends, and
the `u64`-vs-pointer-width split is unique to aarch64-on-ILP32. Inside
this module, type any register-bearing local that participates in
inline asm as `u64`, and cast at the boundaries:

  - `u64::try_from(v).unwrap()` widens `usize` → `u64` (infallible on
    every supported Rust target, the `.unwrap()` documents that any
    failure would be a target-property issue).
  - `as usize` narrows `u64` → `usize` at the return — truncates on
    `arm64_32` by design (the saved PC/SP there is a 32-bit host
    pointer that fits exactly in the low 32 bits) and is the identity
    on aarch64 LP64.

Also switch the saved-LR load from `*(fp as *mut usize).offset(1)` to
`*(fp as *mut u64).offset(1)`. AAPCS64 reserves two 64-bit slots for
the frame record on every aarch64 ABI variant — including `arm64_32` —
so an 8-byte stride is correct regardless of pointer width. With
`*mut usize` on `arm64_32` `.offset(1)` would advance by only 4 bytes
and read the upper half of the saved-FP slot. This is a latent
correctness fix; today the unwinder isn't exercised on `arm64_32`
(which runs Pulley, not Cranelift-compiled native code), but the
corrected form is the right one to land alongside the type change.

No behaviour change on existing aarch64 LP64 targets. Silences two
`asm_sub_register` warnings on a future `arm64_32-apple-watchos` build
of this crate.

[0]: https://doc.rust-lang.org/reference/inline-assembly.html#r-asm.register-operands.smaller-value
mach2 v0.4.2 emits `compile_error!("mach requires macOS or iOS")` on any
target where neither `target_os = "macos"` nor `target_os = "ios"` matches.
That blocks every Apple non-iOS-non-macOS platform — most pressingly
arm64_32-apple-watchos for embedders shipping wasmtime on Apple Watch.

The fix has been upstream in mach2 since 0.6.0 (commit `538ce75`,
2025-08-16, "Add support for tvOS, watchOS and visionOS"), which widens
the cfg gate from `cfg(any(macos, ios))` to `cfg(target_vendor = "apple")`
on both the `compile_error!` and the `libc` build-dep, with no public-API
changes in the modules wasmtime imports
(`exc`, `exception_types`, `kern_return`, `mach_init`, `mach_port`,
 `message`, `ndr`, `port`, `thread_act`, `thread_status`).

Verified by building wasmtime as a `staticlib` for `arm64_32-apple-watchos`
under `nightly-2026-01-25 + -Z build-std=std,panic_abort` with
`--features pulley,runtime,std,cranelift,anyhow` — no other changes
needed in `crates/wasmtime/src/runtime/vm/sys/unix/machports.rs`.

The dev-only path (`cranelift-jit -> region -> mach2 0.4.x`) keeps an
older mach2 in the lockfile for cranelift-jit's own host tests; that
path is not part of any production embedder build and stays unchanged.

Closes the watchOS port story without needing a separate mach2 release.
@matthargett matthargett force-pushed the unwinder-arm64_32-asm-format branch from 48096e4 to f6d629b Compare May 4, 2026 20:42
@alexcrichton
Copy link
Copy Markdown
Member

For the vets I typically push directly to a PR, which by-default works most of the time, but I think the origin of this fork, the rebeckerspecialties organization, doesn't allow that. In lieu of that @matthargett could you cherry-pick alexcrichton@4c193dd into this PR and then I can an approve-and-merge?

@matthargett
Copy link
Copy Markdown
Author

Done — cherry-picked your 4c193dda87f7c4c29e055fc3af39e88fec4b5a39 (Add vets for mach2) onto the head of this PR's branch. New tip is 3c0c73fbde.

Verified locally:

  • cargo vet check succeeds: Vetting Succeeded (482 fully audited, 32 partially audited, 53 exempted).
  • The two pre-existing wildcard-expiry / unnecessary-import warnings are unrelated to this commit.

Ready for approve-and-merge whenever you have a moment. Thanks for the offer to push directly — the rebeckerspecialties org's branch protections do block third-party pushes, so the cherry-pick path is the cleanest workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants