Commit 1613935
authored
perf: SIMD-accelerated FastBase64 for Scala Native via C FFI (#749)
## Motivation
On Scala Native, `java.util.Base64` is a pure-Scala implementation that
uses Wrapper objects, `@tailrec` recursive `iterate()`, and per-byte
pattern matching — significantly slower than HotSpot's intrinsic-backed
implementation.
Beyond the raw codec, `base64DecodeBytes` was creating `Array[Eval](N)`
and filling each slot with `Val.cachedNum` — N allocations for an N-byte
decode. The materializer then needed per-element type dispatch to render
these arrays. And `base64` encode output (guaranteed ASCII-safe) was
still being scanned for JSON escape characters. `Val.Arr` carried inline
`_isRange`/`_byteData` fields that bloated every regular array instance
(~13 bytes wasted per non-specialized array).
## Modification
### 1. Platform-agnostic `FastBase64` encoder/decoder
- `ENCODE_TABLE` (char[64]) and `DECODE_TABLE` (int[256]) pre-computed
lookup tables
- `encodeString()`: ASCII fast path does direct char→char encoding
without intermediate `byte[]`
- `decodeToString()` / `decodeToBytes()`: Direct string→bytes via lookup
table
- ISO-8859-1 compatibility: chars > 0xFF → 0x3F ('?') matching
`java.util.Base64` behavior
### 2. C FFI SIMD base64 for Scala Native (`sjsonnet_base64.c`)
- **AArch64 NEON**: `vld3`/`vst4` interleaved load/store + `vqtbl4q`
64-byte lookup for encode; `vbslq`/`vmovl_u8`/`vmovn_u16` for byte↔char
widening/narrowing
- **x86_64**: SSSE3/AVX2/AVX-512 VBMI paths via
`pshufb`/`vpshufb`/`vpermi2b`
- **Fallback**: Scalar with loop unrolling for other architectures
- `sjsonnet_base64_decode_validated()`: Single-pass validation + decode
with specific error codes
- RFC 4648 compliant with '=' padding
### 3. Native-specific optimizations
- Reusable module-level buffers (safe: Scala Native is single-threaded)
— eliminates per-call array allocations
- ASCII fast-path in `encodeString`: skip UTF-8 encoding for pure ASCII
strings
- Direct char array construction instead of charset lookup
### 4. `RangeArr` and `ByteArr` subclasses of `Val.Arr`
- `Val.Arr` changed from `final class` to non-final `class`, enabling
specialization
- **`RangeArr extends Arr`**: Lazy integer range — keeps `rangeFrom`
field out of regular arrays, saving ~9 bytes per non-range array (merges
#772)
- **`ByteArr extends Arr`**: Compact `Array[Byte]` backing store for
0–255 integer arrays
- `byteData` is an immutable `val` — never cleared after
materialization, guaranteeing `rawBytes` is always non-null for safe
multi-use
- `reversed()` materializes first to keep `value()`/`eval()` simple and
avoid reversed-index bugs
- `rawBytes` accessor enables zero-copy fast paths in `base64` encode
and materializer
- Callers use pattern match (`case ba: Val.ByteArr =>`) instead of
null-returning `rawBytes` on base class
### 5. Materializer fast-path for byte arrays
- Recursive, iterative, and fused ByteRenderer paths all detect
`ByteArr` via pattern match
- Skip `value(i)` lookup + type dispatch + `asDouble` conversion
- Directly emit `visitFloat64((bytes(i) & 0xff).toDouble)` in a tight
loop
### 6. ASCII-safe string rendering
- `Val.Str._asciiSafe` flag marks strings known to contain only
printable ASCII (no JSON escaping needed)
- `Val.Str.asciiSafe(pos, s)` factory for creating flagged strings
- `BaseByteRenderer.renderAsciiSafeString()` skips SWAR escape scanning
and UTF-8 encoding — writes bytes directly from chars
- `base64` encode output is marked as ASCII-safe since base64 alphabet
is `[A-Za-z0-9+/=]`
### 7. `EncodingModule` updates
- `base64DecodeBytes`: Uses `Val.Arr.fromBytes(pos, decoded)` — one
allocation instead of N
- `base64` encode: Pattern matches `ByteArr` for zero-copy bypass;
output marked `asciiSafe`
## Benchmark Results
### JMH (JVM, Scala 3.3.7, Apple Silicon M4 Max)
| Benchmark | Master (ms/op) | PR (ms/op) | Change |
|-----------|---------------|------------|--------|
| base64 | 0.153 | 0.145 | **-5.2%** |
| base64Decode | 0.117 | 0.115 | -1.7% |
| base64DecodeBytes | 5.692 | 5.109 | **-10.2%** |
| base64_byte_array | 0.757 | 0.758 | ~same |
| base64_stress | — | 0.188 | (new) |
### Scala Native (hyperfine -N, 30 runs, Apple Silicon M4 Max)
Compared against jrsonnet **0.5.0-pre98** (built from source, `cargo
build --release`).
| Benchmark | sjsonnet master | sjsonnet PR | jrsonnet 0.5.0 | PR vs
master | PR vs jrsonnet |
|-----------|----------------|-------------|----------------|--------------|----------------|
| base64 | 8.7ms | 6.5ms | 4.4ms | **1.34× faster** | 1.47× slower |
| base64Decode | 7.3ms | 6.8ms | 4.3ms | 1.07× faster | 1.60× slower |
| base64DecodeBytes | 28.7ms | 13.5ms | 20.1ms | **2.13× faster** |
**1.50× faster** |
| base64_byte_array | 10.5ms | 8.5ms | 17.3ms | **1.23× faster** |
**2.02× faster** |
| base64_stress | 6.6ms | 6.3ms | 5.0ms | ~same | 1.28× slower |
**Compute-heavy benchmarks** (`base64DecodeBytes`, `base64_byte_array`):
sjsonnet significantly outperforms jrsonnet — 1.50× and 2.02× faster
respectively.
**Small benchmarks** (`base64`, `base64Decode`, `base64_stress`):
jrsonnet is faster due to lower startup overhead (~3ms vs ~5ms). The
actual base64 computation time is comparable; the gap is dominated by
process startup.
## Files Changed
| File | Change |
|------|--------|
| `sjsonnet/src/sjsonnet/Val.scala` | `Arr` non-final, `RangeArr` +
`ByteArr` subclasses, `_asciiSafe` flag, `asciiSafe` factory |
| `sjsonnet/src/sjsonnet/Materializer.scala` | ByteArr pattern-match
fast path in recursive + iterative paths |
| `sjsonnet/src/sjsonnet/ByteRenderer.scala` | ByteArr fast path in
fused materializer + ASCII-safe string dispatch |
| `sjsonnet/src/sjsonnet/BaseByteRenderer.scala` |
`renderAsciiSafeString()` for escape-free rendering |
| `sjsonnet/src/sjsonnet/stdlib/EncodingModule.scala` | `fromBytes` for
DecodeBytes, ByteArr match for encode, `asciiSafe` for output |
| `sjsonnet/src-js/sjsonnet/stdlib/FastBase64.scala` | Pure Scala
implementation (JS/WASM) |
| `sjsonnet/src-jvm/sjsonnet/stdlib/FastBase64.scala` | Delegates to
`java.util.Base64` (unchanged behavior) |
| `sjsonnet/src-native/sjsonnet/stdlib/FastBase64.scala` | C FFI
wrappers + buffer reuse + ASCII fast paths |
| `sjsonnet/resources/scala-native/sjsonnet_base64.c` | SIMD C
implementation (NEON/SSSE3/AVX2/AVX-512 + scalar fallback) |
| `sjsonnet/test/resources/new_test_suite/byte_arr_correctness.jsonnet`
| Regression tests for ByteArr (multi-use, reverse, concat, round-trip)
|
| `sjsonnet/test/resources/new_test_suite/range_arr_correctness.jsonnet`
| Regression tests for RangeArr correctness |
| `bench/resources/go_suite/base64_stress.jsonnet` | New benchmark for
mixed encode/decode stress test |
## Result
- base64DecodeBytes **2.13× faster** than master, **1.50× faster** than
jrsonnet 0.5.0
- base64_byte_array **2.02× faster** than jrsonnet 0.5.0
- JVM base64DecodeBytes improved **10.2%** vs master
- All JVM, JS, and Native tests pass1 parent 4d16e17 commit 1613935
15 files changed
Lines changed: 2079 additions & 102 deletions
File tree
- bench/resources/go_suite
- sjsonnet
- resources/scala-native
- src-js/sjsonnet/stdlib
- src-jvm/sjsonnet/stdlib
- src-native/sjsonnet/stdlib
- src/sjsonnet
- stdlib
- test/resources/new_test_suite
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
0 commit comments