<Performance> fuzzbug: Repeated small bounded bulk-memory operations are much slower in Wasmtime than in Wasmer Cranelift

## Describe the bug

Repeated small bounded bulk-memory operations appear to be much slower in Wasmtime than in Wasmer Cranelift in a minimal microbenchmark family.

I first found this in generated differential tests for `memory.copy`, then reduced and checked it with a smaller reproducer plus several controls. The slowdown is not limited to one seed, and it is still present after varying copy length and src/dst relation.

[test_cases.zip](https://github.com/user-attachments/files/27391229/test_cases.zip)

The clearest primary reproducer I found is:
- `primary_reproducer_memory_copy_len32.wat`

Useful supporting controls are:
- `supporting_control_memory_copy_len0.wat`
- `supporting_memory_fill_same_shape.wat`
- `supporting_memory_copy_len1.wat`
- `supporting_memory_copy_len64_safe.wat`
- `supporting_memory_copy_src_eq_dst_len32.wat`
- `supporting_memory_copy_src_plus1024_len32_safe.wat`

## Test Case

Primary reproducer loop body:

```wat
(local.get $i)
(i32.wrap_i64)
(i32.const 65504)
(i32.and)
(local.get $i)
(i32.wrap_i64)
(i32.const 1431655765)
(i32.xor)
(i32.const 65504)
(i32.and)
(i32.const 32)
(memory.copy)
```

The reduced reproducer uses:
- trip count: `2^28`
- one page of memory: `(memory 1)`
- both src/dst addresses constrained to a small low-memory window

The closest controls are:
- same shape, but `memory.copy` length changed to `0`
- same shape, but `memory.copy` replaced with `memory.fill`
- same shape, but copy lengths changed across `1/4/8/16/32/64`
- same shape, but src/dst relation changed to `src == dst` and `src = dst + 1024`

## Steps to Reproduce

1. Build the primary testcase:

```bash
wat2wasm primary_reproducer_memory_copy_len32.wat -o primary_reproducer_memory_copy_len32.wasm
```

2. Warm up once:

```bash
wasmtime primary_reproducer_memory_copy_len32.wasm
```

3. Measure runtime:

```bash
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_memory_copy_len32.wasm
```

4. For comparison, run the same flow on the supporting testcases listed above.

5. If helpful, compare against Wasmer Cranelift with:

```bash
wasmer run primary_reproducer_memory_copy_len32.wasm
perf stat -r 3 -e 'task-clock' wasmer run primary_reproducer_memory_copy_len32.wasm
```

## Expected and actual Results

### Primary `memory.copy` reproducer and close controls

| testcase | shape | wasmer_cranelift (s) | wasmtime (s) | ratio |
|---|---|---:|---:|---:|
| control_drop | target removed | 0.09570 | 0.08054 | 0.84x |
| memory.copy len=0 | xor-shaped src/dst, bounded window | 0.97108 | 2.61960 | 2.70x |
| memory.copy len=32 | xor-shaped src/dst, bounded window | 0.76792 | 2.68820 | 3.50x |
| memory.fill len=32 | same bounded address shape | 0.64743 | 2.25270 | 3.48x |

Observed pattern:
- the target-removed control is fast in both runtimes;
- Wasmtime is already much slower for `memory.copy len=0`;
- the slowdown remains for `memory.copy len=32`;
- a related bulk-memory instruction (`memory.fill`) shows a similar gap.

This makes the anomaly look more like bulk-memory helper/runtime-path cost than payload movement cost.

### Length sweep for `memory.copy`

| testcase | wasmer_cranelift (s) | wasmtime (s) | ratio |
|---|---:|---:|---:|
| len=0 | 0.97080 | 2.73150 | 2.81x |
| len=1 | 0.97112 | 2.99300 | 3.08x |
| len=4 | 0.89589 | 2.82370 | 3.15x |
| len=8 | 0.89769 | 2.81460 | 3.14x |
| len=16 | 0.91569 | 2.77790 | 3.03x |
| len=32 | 0.76524 | 2.65780 | 3.47x |
| len=64 (safe window) | 0.76253 | 2.68210 | 3.52x |

Observed pattern:
- from `len=0` through `len=64`, the slowdown ratio stays broadly stable;
- the main trigger does not seem to be the payload size itself.

### Src/dst relation sweep for `memory.copy len=32`

| testcase | wasmer_cranelift (s) | wasmtime (s) | ratio |
|---|---:|---:|---:|
| src == dst | 0.75430 | 2.61270 | 3.46x |
| src = dst + 1024 (safe) | 0.72937 | 2.57010 | 3.52x |

Observed pattern:
- the gap remains even when the copy is self-copy or a fixed-offset in-bounds copy;
- this does not look specific to the original xor-shaped address relation;
- this also does not look primarily driven by overlap semantics.

### Family-level consistency

The original full-trip generated `memory_copy_*` seeds all showed `wasmtime > wasmer_cranelift`:

| testcase | wasmer_cranelift (s) | wasmtime (s) | ratio |
|---|---:|---:|---:|
| memory_copy_1 | 12.1567 | 39.8686 | 3.28x |
| memory_copy_2 | 13.4606 | 36.9620 | 2.75x |
| memory_copy_3 | 19.7391 | 36.3320 | 1.84x |
| memory_copy_4 | 23.0015 | 36.1513 | 1.57x |
| memory_copy_5 | 9.9472 | 37.0502 | 3.72x |

Related `memory_fill_*` seeds also showed the same direction:

| testcase | wasmer_cranelift (s) | wasmtime (s) | ratio |
|---|---:|---:|---:|
| memory_fill_1 | 9.8347 | 31.9666 | 3.25x |
| memory_fill_2 | 12.0900 | 32.1992 | 2.66x |
| memory_fill_3 | 12.9405 | 34.5910 | 2.67x |

## Versions and Environment

- wasmtime: 41.0.0 (4898322a4 2025-12-18)
- wasmer: 6.1.0
- WAMR: iwasm 2.4.4
- wasmedge: 0.16.1-18-gc457fe30
- wabt: 1.0.39
- llvm: 21.1.5
- Host OS: Ubuntu 22.04.5 LTS x64
- CPU: 12th Gen Intel® Core™ i7-12700 × 20

## Extra Info

For the primary reduced testcase, I also checked Wasmtime CLIF to make sure the benchmark is still alive.

I generated CLIF with:

```bash
wasmtime compile -C cache=n --emit-clif out_dir primary_reproducer_memory_copy_len32.wasm
```

In the generated CLIF for the hot loop, the operation is still lowered through a per-iteration builtin call equivalent to:

```text
call fn0(vmctx, 0, dst, 0, src, len)
```

The emitted builtin `wasmtime_builtin_memory_copy` still performs a deeper indirect runtime call:

```text
v11 = call_indirect sig0, v10(v0, v1, v2, v3, v4, v5)
```

So this does not look like dead-code elimination or a broken benchmark scaffold.

The strongest trigger condition I can currently support is:
- repeated small bounded bulk-memory operations;
- one-page memory with a hot low-memory window;
- slowdown present for both `memory.copy` and `memory.fill`;
- largely independent of copy length (`0..64` in this sweep) and src/dst relation.

I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<Performance> fuzzbug: Repeated small bounded bulk-memory operations are much slower in Wasmtime than in Wasmer Cranelift #13272

Describe the bug

Test Case

Steps to Reproduce

Expected and actual Results

Primary `memory.copy` reproducer and close controls

Length sweep for `memory.copy`

Src/dst relation sweep for `memory.copy len=32`

Family-level consistency

Versions and Environment

Extra Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

testcase	shape	wasmer_cranelift (s)	wasmtime (s)	ratio
control_drop	target removed	0.09570	0.08054	0.84x
memory.copy len=0	xor-shaped src/dst, bounded window	0.97108	2.61960	2.70x
memory.copy len=32	xor-shaped src/dst, bounded window	0.76792	2.68820	3.50x
memory.fill len=32	same bounded address shape	0.64743	2.25270	3.48x

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
len=0	0.97080	2.73150	2.81x
len=1	0.97112	2.99300	3.08x
len=4	0.89589	2.82370	3.15x
len=8	0.89769	2.81460	3.14x
len=16	0.91569	2.77790	3.03x
len=32	0.76524	2.65780	3.47x
len=64 (safe window)	0.76253	2.68210	3.52x

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
src == dst	0.75430	2.61270	3.46x
src = dst + 1024 (safe)	0.72937	2.57010	3.52x

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
memory_copy_1	12.1567	39.8686	3.28x
memory_copy_2	13.4606	36.9620	2.75x
memory_copy_3	19.7391	36.3320	1.84x
memory_copy_4	23.0015	36.1513	1.57x
memory_copy_5	9.9472	37.0502	3.72x

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
memory_fill_1	9.8347	31.9666	3.25x
memory_fill_2	12.0900	32.1992	2.66x
memory_fill_3	12.9405	34.5910	2.67x

<Performance> fuzzbug: Repeated small bounded bulk-memory operations are much slower in Wasmtime than in Wasmer Cranelift #13272

Description

Describe the bug

Test Case

Steps to Reproduce

Expected and actual Results

Primary memory.copy reproducer and close controls

Length sweep for memory.copy

Src/dst relation sweep for memory.copy len=32

Family-level consistency

Versions and Environment

Extra Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Primary `memory.copy` reproducer and close controls

Length sweep for `memory.copy`

Src/dst relation sweep for `memory.copy len=32`