Describe the bug
Repeated small bounded bulk-memory operations appear to be much slower in Wasmtime than in Wasmer Cranelift in a minimal microbenchmark family.
I first found this in generated differential tests for memory.copy, then reduced and checked it with a smaller reproducer plus several controls. The slowdown is not limited to one seed, and it is still present after varying copy length and src/dst relation.
test_cases.zip
The clearest primary reproducer I found is:
primary_reproducer_memory_copy_len32.wat
Useful supporting controls are:
supporting_control_memory_copy_len0.wat
supporting_memory_fill_same_shape.wat
supporting_memory_copy_len1.wat
supporting_memory_copy_len64_safe.wat
supporting_memory_copy_src_eq_dst_len32.wat
supporting_memory_copy_src_plus1024_len32_safe.wat
Test Case
Primary reproducer loop body:
(local.get $i)
(i32.wrap_i64)
(i32.const 65504)
(i32.and)
(local.get $i)
(i32.wrap_i64)
(i32.const 1431655765)
(i32.xor)
(i32.const 65504)
(i32.and)
(i32.const 32)
(memory.copy)
The reduced reproducer uses:
- trip count:
2^28
- one page of memory:
(memory 1)
- both src/dst addresses constrained to a small low-memory window
The closest controls are:
- same shape, but
memory.copy length changed to 0
- same shape, but
memory.copy replaced with memory.fill
- same shape, but copy lengths changed across
1/4/8/16/32/64
- same shape, but src/dst relation changed to
src == dst and src = dst + 1024
Steps to Reproduce
- Build the primary testcase:
wat2wasm primary_reproducer_memory_copy_len32.wat -o primary_reproducer_memory_copy_len32.wasm
- Warm up once:
wasmtime primary_reproducer_memory_copy_len32.wasm
- Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_memory_copy_len32.wasm
-
For comparison, run the same flow on the supporting testcases listed above.
-
If helpful, compare against Wasmer Cranelift with:
wasmer run primary_reproducer_memory_copy_len32.wasm
perf stat -r 3 -e 'task-clock' wasmer run primary_reproducer_memory_copy_len32.wasm
Expected and actual Results
Primary memory.copy reproducer and close controls
| testcase |
shape |
wasmer_cranelift (s) |
wasmtime (s) |
ratio |
| control_drop |
target removed |
0.09570 |
0.08054 |
0.84x |
| memory.copy len=0 |
xor-shaped src/dst, bounded window |
0.97108 |
2.61960 |
2.70x |
| memory.copy len=32 |
xor-shaped src/dst, bounded window |
0.76792 |
2.68820 |
3.50x |
| memory.fill len=32 |
same bounded address shape |
0.64743 |
2.25270 |
3.48x |
Observed pattern:
- the target-removed control is fast in both runtimes;
- Wasmtime is already much slower for
memory.copy len=0;
- the slowdown remains for
memory.copy len=32;
- a related bulk-memory instruction (
memory.fill) shows a similar gap.
This makes the anomaly look more like bulk-memory helper/runtime-path cost than payload movement cost.
Length sweep for memory.copy
| testcase |
wasmer_cranelift (s) |
wasmtime (s) |
ratio |
| len=0 |
0.97080 |
2.73150 |
2.81x |
| len=1 |
0.97112 |
2.99300 |
3.08x |
| len=4 |
0.89589 |
2.82370 |
3.15x |
| len=8 |
0.89769 |
2.81460 |
3.14x |
| len=16 |
0.91569 |
2.77790 |
3.03x |
| len=32 |
0.76524 |
2.65780 |
3.47x |
| len=64 (safe window) |
0.76253 |
2.68210 |
3.52x |
Observed pattern:
- from
len=0 through len=64, the slowdown ratio stays broadly stable;
- the main trigger does not seem to be the payload size itself.
Src/dst relation sweep for memory.copy len=32
| testcase |
wasmer_cranelift (s) |
wasmtime (s) |
ratio |
| src == dst |
0.75430 |
2.61270 |
3.46x |
| src = dst + 1024 (safe) |
0.72937 |
2.57010 |
3.52x |
Observed pattern:
- the gap remains even when the copy is self-copy or a fixed-offset in-bounds copy;
- this does not look specific to the original xor-shaped address relation;
- this also does not look primarily driven by overlap semantics.
Family-level consistency
The original full-trip generated memory_copy_* seeds all showed wasmtime > wasmer_cranelift:
| testcase |
wasmer_cranelift (s) |
wasmtime (s) |
ratio |
| memory_copy_1 |
12.1567 |
39.8686 |
3.28x |
| memory_copy_2 |
13.4606 |
36.9620 |
2.75x |
| memory_copy_3 |
19.7391 |
36.3320 |
1.84x |
| memory_copy_4 |
23.0015 |
36.1513 |
1.57x |
| memory_copy_5 |
9.9472 |
37.0502 |
3.72x |
Related memory_fill_* seeds also showed the same direction:
| testcase |
wasmer_cranelift (s) |
wasmtime (s) |
ratio |
| memory_fill_1 |
9.8347 |
31.9666 |
3.25x |
| memory_fill_2 |
12.0900 |
32.1992 |
2.66x |
| memory_fill_3 |
12.9405 |
34.5910 |
2.67x |
Versions and Environment
- wasmtime: 41.0.0 (4898322 2025-12-18)
- wasmer: 6.1.0
- WAMR: iwasm 2.4.4
- wasmedge: 0.16.1-18-gc457fe30
- wabt: 1.0.39
- llvm: 21.1.5
- Host OS: Ubuntu 22.04.5 LTS x64
- CPU: 12th Gen Intel® Core™ i7-12700 × 20
Extra Info
For the primary reduced testcase, I also checked Wasmtime CLIF to make sure the benchmark is still alive.
I generated CLIF with:
wasmtime compile -C cache=n --emit-clif out_dir primary_reproducer_memory_copy_len32.wasm
In the generated CLIF for the hot loop, the operation is still lowered through a per-iteration builtin call equivalent to:
call fn0(vmctx, 0, dst, 0, src, len)
The emitted builtin wasmtime_builtin_memory_copy still performs a deeper indirect runtime call:
v11 = call_indirect sig0, v10(v0, v1, v2, v3, v4, v5)
So this does not look like dead-code elimination or a broken benchmark scaffold.
The strongest trigger condition I can currently support is:
- repeated small bounded bulk-memory operations;
- one-page memory with a hot low-memory window;
- slowdown present for both
memory.copy and memory.fill;
- largely independent of copy length (
0..64 in this sweep) and src/dst relation.
I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.
Describe the bug
Repeated small bounded bulk-memory operations appear to be much slower in Wasmtime than in Wasmer Cranelift in a minimal microbenchmark family.
I first found this in generated differential tests for
memory.copy, then reduced and checked it with a smaller reproducer plus several controls. The slowdown is not limited to one seed, and it is still present after varying copy length and src/dst relation.test_cases.zip
The clearest primary reproducer I found is:
primary_reproducer_memory_copy_len32.watUseful supporting controls are:
supporting_control_memory_copy_len0.watsupporting_memory_fill_same_shape.watsupporting_memory_copy_len1.watsupporting_memory_copy_len64_safe.watsupporting_memory_copy_src_eq_dst_len32.watsupporting_memory_copy_src_plus1024_len32_safe.watTest Case
Primary reproducer loop body:
The reduced reproducer uses:
2^28(memory 1)The closest controls are:
memory.copylength changed to0memory.copyreplaced withmemory.fill1/4/8/16/32/64src == dstandsrc = dst + 1024Steps to Reproduce
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_memory_copy_len32.wasmFor comparison, run the same flow on the supporting testcases listed above.
If helpful, compare against Wasmer Cranelift with:
wasmer run primary_reproducer_memory_copy_len32.wasm perf stat -r 3 -e 'task-clock' wasmer run primary_reproducer_memory_copy_len32.wasmExpected and actual Results
Primary
memory.copyreproducer and close controlsObserved pattern:
memory.copy len=0;memory.copy len=32;memory.fill) shows a similar gap.This makes the anomaly look more like bulk-memory helper/runtime-path cost than payload movement cost.
Length sweep for
memory.copyObserved pattern:
len=0throughlen=64, the slowdown ratio stays broadly stable;Src/dst relation sweep for
memory.copy len=32Observed pattern:
Family-level consistency
The original full-trip generated
memory_copy_*seeds all showedwasmtime > wasmer_cranelift:Related
memory_fill_*seeds also showed the same direction:Versions and Environment
Extra Info
For the primary reduced testcase, I also checked Wasmtime CLIF to make sure the benchmark is still alive.
I generated CLIF with:
In the generated CLIF for the hot loop, the operation is still lowered through a per-iteration builtin call equivalent to:
The emitted builtin
wasmtime_builtin_memory_copystill performs a deeper indirect runtime call:So this does not look like dead-code elimination or a broken benchmark scaffold.
The strongest trigger condition I can currently support is:
memory.copyandmemory.fill;0..64in this sweep) and src/dst relation.I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.