Skip to content

Commit 1613935

Browse files
authored
perf: SIMD-accelerated FastBase64 for Scala Native via C FFI (#749)
## Motivation On Scala Native, `java.util.Base64` is a pure-Scala implementation that uses Wrapper objects, `@tailrec` recursive `iterate()`, and per-byte pattern matching — significantly slower than HotSpot's intrinsic-backed implementation. Beyond the raw codec, `base64DecodeBytes` was creating `Array[Eval](N)` and filling each slot with `Val.cachedNum` — N allocations for an N-byte decode. The materializer then needed per-element type dispatch to render these arrays. And `base64` encode output (guaranteed ASCII-safe) was still being scanned for JSON escape characters. `Val.Arr` carried inline `_isRange`/`_byteData` fields that bloated every regular array instance (~13 bytes wasted per non-specialized array). ## Modification ### 1. Platform-agnostic `FastBase64` encoder/decoder - `ENCODE_TABLE` (char[64]) and `DECODE_TABLE` (int[256]) pre-computed lookup tables - `encodeString()`: ASCII fast path does direct char→char encoding without intermediate `byte[]` - `decodeToString()` / `decodeToBytes()`: Direct string→bytes via lookup table - ISO-8859-1 compatibility: chars > 0xFF → 0x3F ('?') matching `java.util.Base64` behavior ### 2. C FFI SIMD base64 for Scala Native (`sjsonnet_base64.c`) - **AArch64 NEON**: `vld3`/`vst4` interleaved load/store + `vqtbl4q` 64-byte lookup for encode; `vbslq`/`vmovl_u8`/`vmovn_u16` for byte↔char widening/narrowing - **x86_64**: SSSE3/AVX2/AVX-512 VBMI paths via `pshufb`/`vpshufb`/`vpermi2b` - **Fallback**: Scalar with loop unrolling for other architectures - `sjsonnet_base64_decode_validated()`: Single-pass validation + decode with specific error codes - RFC 4648 compliant with '=' padding ### 3. Native-specific optimizations - Reusable module-level buffers (safe: Scala Native is single-threaded) — eliminates per-call array allocations - ASCII fast-path in `encodeString`: skip UTF-8 encoding for pure ASCII strings - Direct char array construction instead of charset lookup ### 4. `RangeArr` and `ByteArr` subclasses of `Val.Arr` - `Val.Arr` changed from `final class` to non-final `class`, enabling specialization - **`RangeArr extends Arr`**: Lazy integer range — keeps `rangeFrom` field out of regular arrays, saving ~9 bytes per non-range array (merges #772) - **`ByteArr extends Arr`**: Compact `Array[Byte]` backing store for 0–255 integer arrays - `byteData` is an immutable `val` — never cleared after materialization, guaranteeing `rawBytes` is always non-null for safe multi-use - `reversed()` materializes first to keep `value()`/`eval()` simple and avoid reversed-index bugs - `rawBytes` accessor enables zero-copy fast paths in `base64` encode and materializer - Callers use pattern match (`case ba: Val.ByteArr =>`) instead of null-returning `rawBytes` on base class ### 5. Materializer fast-path for byte arrays - Recursive, iterative, and fused ByteRenderer paths all detect `ByteArr` via pattern match - Skip `value(i)` lookup + type dispatch + `asDouble` conversion - Directly emit `visitFloat64((bytes(i) & 0xff).toDouble)` in a tight loop ### 6. ASCII-safe string rendering - `Val.Str._asciiSafe` flag marks strings known to contain only printable ASCII (no JSON escaping needed) - `Val.Str.asciiSafe(pos, s)` factory for creating flagged strings - `BaseByteRenderer.renderAsciiSafeString()` skips SWAR escape scanning and UTF-8 encoding — writes bytes directly from chars - `base64` encode output is marked as ASCII-safe since base64 alphabet is `[A-Za-z0-9+/=]` ### 7. `EncodingModule` updates - `base64DecodeBytes`: Uses `Val.Arr.fromBytes(pos, decoded)` — one allocation instead of N - `base64` encode: Pattern matches `ByteArr` for zero-copy bypass; output marked `asciiSafe` ## Benchmark Results ### JMH (JVM, Scala 3.3.7, Apple Silicon M4 Max) | Benchmark | Master (ms/op) | PR (ms/op) | Change | |-----------|---------------|------------|--------| | base64 | 0.153 | 0.145 | **-5.2%** | | base64Decode | 0.117 | 0.115 | -1.7% | | base64DecodeBytes | 5.692 | 5.109 | **-10.2%** | | base64_byte_array | 0.757 | 0.758 | ~same | | base64_stress | — | 0.188 | (new) | ### Scala Native (hyperfine -N, 30 runs, Apple Silicon M4 Max) Compared against jrsonnet **0.5.0-pre98** (built from source, `cargo build --release`). | Benchmark | sjsonnet master | sjsonnet PR | jrsonnet 0.5.0 | PR vs master | PR vs jrsonnet | |-----------|----------------|-------------|----------------|--------------|----------------| | base64 | 8.7ms | 6.5ms | 4.4ms | **1.34× faster** | 1.47× slower | | base64Decode | 7.3ms | 6.8ms | 4.3ms | 1.07× faster | 1.60× slower | | base64DecodeBytes | 28.7ms | 13.5ms | 20.1ms | **2.13× faster** | **1.50× faster** | | base64_byte_array | 10.5ms | 8.5ms | 17.3ms | **1.23× faster** | **2.02× faster** | | base64_stress | 6.6ms | 6.3ms | 5.0ms | ~same | 1.28× slower | **Compute-heavy benchmarks** (`base64DecodeBytes`, `base64_byte_array`): sjsonnet significantly outperforms jrsonnet — 1.50× and 2.02× faster respectively. **Small benchmarks** (`base64`, `base64Decode`, `base64_stress`): jrsonnet is faster due to lower startup overhead (~3ms vs ~5ms). The actual base64 computation time is comparable; the gap is dominated by process startup. ## Files Changed | File | Change | |------|--------| | `sjsonnet/src/sjsonnet/Val.scala` | `Arr` non-final, `RangeArr` + `ByteArr` subclasses, `_asciiSafe` flag, `asciiSafe` factory | | `sjsonnet/src/sjsonnet/Materializer.scala` | ByteArr pattern-match fast path in recursive + iterative paths | | `sjsonnet/src/sjsonnet/ByteRenderer.scala` | ByteArr fast path in fused materializer + ASCII-safe string dispatch | | `sjsonnet/src/sjsonnet/BaseByteRenderer.scala` | `renderAsciiSafeString()` for escape-free rendering | | `sjsonnet/src/sjsonnet/stdlib/EncodingModule.scala` | `fromBytes` for DecodeBytes, ByteArr match for encode, `asciiSafe` for output | | `sjsonnet/src-js/sjsonnet/stdlib/FastBase64.scala` | Pure Scala implementation (JS/WASM) | | `sjsonnet/src-jvm/sjsonnet/stdlib/FastBase64.scala` | Delegates to `java.util.Base64` (unchanged behavior) | | `sjsonnet/src-native/sjsonnet/stdlib/FastBase64.scala` | C FFI wrappers + buffer reuse + ASCII fast paths | | `sjsonnet/resources/scala-native/sjsonnet_base64.c` | SIMD C implementation (NEON/SSSE3/AVX2/AVX-512 + scalar fallback) | | `sjsonnet/test/resources/new_test_suite/byte_arr_correctness.jsonnet` | Regression tests for ByteArr (multi-use, reverse, concat, round-trip) | | `sjsonnet/test/resources/new_test_suite/range_arr_correctness.jsonnet` | Regression tests for RangeArr correctness | | `bench/resources/go_suite/base64_stress.jsonnet` | New benchmark for mixed encode/decode stress test | ## Result - base64DecodeBytes **2.13× faster** than master, **1.50× faster** than jrsonnet 0.5.0 - base64_byte_array **2.02× faster** than jrsonnet 0.5.0 - JVM base64DecodeBytes improved **10.2%** vs master - All JVM, JS, and Native tests pass
1 parent 4d16e17 commit 1613935

15 files changed

Lines changed: 2079 additions & 102 deletions

File tree

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
local largeStr = std.repeat("Lorem ipsum dolor sit amet, consectetur adipiscing elit. ", 100),
3+
local encoded = std.base64(largeStr),
4+
local decoded = std.base64Decode(encoded),
5+
local encodedArr = std.base64(std.makeArray(1000, function(i) i % 256)),
6+
local decodedBytes = std.base64DecodeBytes(encodedArr),
7+
8+
local encoded2 = std.base64(decoded),
9+
local decoded2 = std.base64Decode(encoded2),
10+
local encodedArr2 = std.base64(std.makeArray(2000, function(i) (i * 7 + 13) % 256)),
11+
local decodedBytes2 = std.base64DecodeBytes(encodedArr2),
12+
13+
local encoded3 = std.base64(decoded2),
14+
local decoded3 = std.base64Decode(encoded3),
15+
local encodedArr3 = std.base64(std.makeArray(3000, function(i) (i * 13 + 37) % 256)),
16+
local decodedBytes3 = std.base64DecodeBytes(encodedArr3),
17+
18+
roundtrip_ok: decoded3 == largeStr,
19+
byte_roundtrip_ok: std.length(decodedBytes3) == 3000,
20+
encoded_len: std.length(encoded3),
21+
decoded_len: std.length(decoded3)
22+
}
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"byte_roundtrip_ok": true,
3+
"decoded_len": 5700,
4+
"encoded_len": 7600,
5+
"roundtrip_ok": true
6+
}

0 commit comments

Comments
 (0)