Skip to content

Commit 82301d7

Browse files
authored
perf: Full byte[] rendering pipeline with SWAR escape scanning and fused materializer (#745)
## Motivation The rendering pipeline is the dominant cost in sjsonnet's output path. On Scala Native, `realistic2` materialization alone takes ~190ms out of ~270ms total (70%). The existing pipeline routes through `char[]` buffers → `OutputStreamWriter` → UTF-8 encoding → `byte[]` → `OutputStream`, adding unnecessary conversion layers for what is predominantly ASCII JSON output. This PR introduces a full `byte[]` rendering pipeline that eliminates the char-to-byte conversion entirely, adds SWAR (SIMD Within A Register) escape-character scanning, zero-allocation integer rendering, and a fused materializer that bypasses the upickle Visitor dispatch interface. ## Key Design Decisions 1. **byte[] pipeline over char[]**: `BaseByteRenderer` mirrors `BaseCharRenderer` but uses `upickle.core.ByteBuilder` (byte[]) instead of `CharBuilder` (char[]), writing directly to `OutputStream`. This eliminates the `OutputStreamWriter` UTF-8 encoding layer and halves buffer memory for ASCII content. 2. **SWAR escape-char scanning**: `CharSWAR` processes 8 bytes per iteration using bitwise parallel techniques (Hacker's Delight Ch. 6 zero-detection) to detect `"`, `\`, and control chars. Platform-specific implementations: JVM uses `VarHandle` for misaligned reads, Scala Native uses `Intrinsics.loadLong` + `ByteArray.atRawUnsafe`, JS falls back to scalar loops. 3. **Two-tier string rendering**: Short strings (< 128 chars) use a fused encode+check loop with zero allocation. Long strings (≥ 128 chars) use `getBytes(UTF-8)` + SWAR bulk scan + `arraycopy`. The SWAR pre-scan determines if the fast path (direct copy) can be taken, avoiding per-character escape processing for clean strings. 4. **Digit-pair lookup table**: Integer rendering uses two-digits-at-a-time conversion via `DIGIT_TENS`/`DIGIT_ONES` lookup tables, writing backward into a scratch buffer then bulk-copying. Eliminates `Long.toString` allocation for the most common numeric output. 5. **Fused materializer+renderer**: `ByteRenderer.materializeDirect()` walks the `Val` tree and writes JSON bytes directly, bypassing the upickle `Visitor` interface entirely (no `visitObject`/`visitArray`/`visitKey`/`visitValue`/`subVisitor` virtual dispatch). Uses `@switch` on `valTag` for O(1) type routing. Falls back to the generic `Materializer.apply0` path for deeply nested structures. 6. **Reusable visitor instances**: Pre-allocated `ArrVisitor`/`ObjVisitor` fields with a `Long` bitset for empty-state tracking (bit per nesting level, supports 64 levels). Eliminates per-array/per-object anonymous class allocation in the non-fused visitor path. 7. **Bulk indentation**: `renderIndent` uses `System.arraycopy` from a pre-allocated 64-byte spaces buffer instead of character-by-character append. 8. **Native fwrite direct stdout**: `NativeOutputStream` bypasses the Scala Native JVM compat layer (`PrintStream.write (synchronized)` → `FileOutputStream` → `FileChannelImpl` → `unistd.write`) with direct `stdio.fwrite(buf.at(off), 1, len, file)`. Eliminates per-write synchronization and syscall indirection. ## Modifications ### New files **`BaseByteRenderer.scala`** (shared `src/`): Byte-oriented JSON renderer extending `ujson.JsVisitor[OutputStream, OutputStream]`. Handles all JSON primitives, string rendering (short/long paths), integer rendering (digit-pair tables), and indentation. Provides `renderQuotedString` for the fused path. **`ByteRenderer.scala`** (shared `src/`): sjsonnet-specific byte renderer with custom double formatting (matching google/jsonnet output), empty `{ }`/`[ ]` rendering, reusable visitor instances, and the fused materializer (`materializeDirect`, `materializeChild`, `materializeDirectObj`, `materializeDirectArr`). **`CharSWAR.java`** (JVM `src-jvm/`): SWAR scanner using `VarHandle.get(byte[], offset)` for misaligned 8-byte reads. Handles both `String` (via `getChars` to char[]) and `byte[]` inputs. **`CharSWAR.scala`** (Native `src-native/`): SWAR scanner using `Intrinsics.loadLong` + `ByteArray.atRawUnsafe` for zero-overhead bulk reads. **`CharSWAR.scala`** (JS `src-js/`): Scalar fallback for Scala.js (no SWAR — JS lacks raw memory access). **`NativeOutputStream.scala`** (Native `src-native/`): Direct `fwrite`-based OutputStream for Scala Native, bypassing the JVM compat chain. ### Modified files **`SjsonnetMainBase.scala`**: File output and stdout paths now use `ByteRenderer` directly (bypassing `OutputStreamWriter`). Stdout path returns a sentinel value to avoid re-printing already-written output. Added `rawOutputStream` parameter to support Native fwrite bypass. **`SjsonnetMain.scala`** (Native): Passes `NativeOutputStream(stdio.stdout)` as `rawOutputStream`. **`Interpreter.scala`**: `materialize()` detects `ByteRenderer` and routes to the fused `materializeDirect()` path, bypassing the generic `Materializer.apply0` visitor dispatch. **`BaseCharRenderer.scala`**: `visitNonNullString` now uses `CharSWAR.hasEscapeChar` for pre-scanning. Added `writeLongDirect` with digit-pair lookup tables. Added companion object with lookup tables. **`Renderer.scala`**: `visitFloat64` inlined to avoid `RenderUtils.renderDouble` String allocation — uses `writeLongDirect` for integers, `BigDecimal` for whole-number doubles, `d.toString` for fractionals. **`Materializer.scala`**: Fixed `Apply`/`Apply0-3` pattern match arity for auto-TCO `strict` field (upstream `ecdd0b6`). ## Benchmark Results ### Hyperfine (Scala Native, `realistic2`, averaged over 2 rounds) | Config | Master (ms) | This PR (ms) | Speedup | |--------|:-----------:|:------------:|:-------:| | stdout | 270 ± 5 | 175 ± 6 | **1.55x (35% faster)** | | stdout `-p` | 250 ± 4 | 162 ± 3 | **1.54x (35% faster)** | | file `-o` | 449 ± 69 | 405 ± 69 | 1.11x (IO bound) | Output correctness verified: `diff` confirms byte-identical output between master and this PR. ## Analysis The byte[] pipeline optimization stacks four independent wins: 1. **OutputStreamWriter elimination** (~10%): Removing the char[]→UTF-8→byte[] conversion layer. Most impactful for file output where the full `OutputStreamWriter` synchronization overhead applies. 2. **SWAR escape scanning** (~5%): 8x throughput for escape-char detection on clean strings (the common case). The SWAR pre-scan gates a fast bulk-copy path, avoiding per-character processing. 3. **Fused materializer** (~15-20%): Eliminating Visitor interface virtual dispatch. On JVM with JIT, devirtualization handles most of this automatically. On Scala Native without JIT, every `visitObject`/`subVisitor`/`visitKey`/`visitValue`/`visitEnd` call is a vtable lookup + indirect branch — the fused path replaces all of these with direct method calls. 4. **Native fwrite bypass** (~5%): Eliminating `PrintStream` synchronized lock + `FileChannelImpl` indirection on every write. `stdio.fwrite` has internal buffering and batches small writes before syscall. ## Notes The `lazy_reverse_correctness.jsonnet` test failure on Scala 2.13.18 is a **pre-existing upstream bug** from PR #741 (lazy reverse array). Upstream master itself does not compile on 2.13 due to the auto-TCO pattern match arity issue (ecdd0b6), so this test was never run on 2.13 upstream. This PR fixes the compilation issue but exposes the runtime bug. This is not a regression introduced by this PR. ## Result - All test suites pass on Scala 3.3.7, JS, WASM, Native - Scala 2.13.18: 1 pre-existing upstream failure (`lazy_reverse_correctness.jsonnet`) - No regressions detected - Output is byte-identical to master for all test cases
1 parent 097f44e commit 82301d7

10 files changed

Lines changed: 1232 additions & 53 deletions

File tree

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
package sjsonnet
2+
3+
/** Scalar fallback for Scala.js — no SWAR, per-char scan. */
4+
object CharSWAR {
5+
def hasEscapeChar(s: String): Boolean = {
6+
var i = 0
7+
val len = s.length
8+
while (i < len) {
9+
val c = s.charAt(i)
10+
if (c < 32 || c == '"' || c == '\\') return true
11+
i += 1
12+
}
13+
false
14+
}
15+
16+
def hasEscapeChar(arr: Array[Char], from: Int, to: Int): Boolean = {
17+
var i = from
18+
while (i < to) {
19+
val c = arr(i)
20+
if (c < 32 || c == '"' || c == '\\') return true
21+
i += 1
22+
}
23+
false
24+
}
25+
26+
/** Scalar scan for byte[] — used by ByteRenderer for UTF-8 encoded data. */
27+
def hasEscapeChar(arr: Array[Byte], from: Int, to: Int): Boolean = {
28+
var i = from
29+
while (i < to) {
30+
val b = arr(i) & 0xff
31+
if (b < 32 || b == '"' || b == '\\') return true
32+
i += 1
33+
}
34+
false
35+
}
36+
}

sjsonnet/src-jvm-native/sjsonnet/SjsonnetMainBase.scala

Lines changed: 87 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ import upickle.core.SimpleVisitor
55
import java.io.{
66
BufferedOutputStream,
77
InputStream,
8+
OutputStream,
89
OutputStreamWriter,
910
PrintStream,
1011
StringWriter,
@@ -16,6 +17,12 @@ import scala.annotation.unused
1617
import scala.util.Try
1718

1819
object SjsonnetMainBase {
20+
21+
/**
22+
* Sentinel value returned when output was already written directly to stdout via byte pipeline.
23+
*/
24+
private val ByteRenderedSentinel = "\u0000"
25+
1926
class SimpleImporter(
2027
searchRoots0: Seq[Path], // Evaluated in order, first occurrence wins
2128
allowedInputs: Option[Set[os.Path]] = None,
@@ -101,7 +108,33 @@ object SjsonnetMainBase {
101108
allowedInputs: Option[Set[os.Path]] = None,
102109
importer: Option[Importer] = None,
103110
std: Val.Obj = sjsonnet.stdlib.StdLibModule.Default.module,
104-
jsonnetPathEnv: Option[String] = None): Int = {
111+
jsonnetPathEnv: Option[String] = None): Int =
112+
main0(
113+
args,
114+
parseCache,
115+
null.asInstanceOf[InputStream], // stdin is @unused in the target overload
116+
stdout,
117+
stderr,
118+
wd,
119+
allowedInputs,
120+
importer,
121+
std,
122+
jsonnetPathEnv,
123+
rawOutputStream = null
124+
)
125+
126+
def main0(
127+
args: Array[String],
128+
parseCache: ParseCache,
129+
@unused stdin: InputStream,
130+
stdout: PrintStream,
131+
stderr: PrintStream,
132+
wd: os.Path,
133+
allowedInputs: Option[Set[os.Path]],
134+
importer: Option[Importer],
135+
std: Val.Obj,
136+
jsonnetPathEnv: Option[String],
137+
rawOutputStream: OutputStream): Int = {
105138

106139
var hasWarnings = false
107140
def warn(isTrace: Boolean, msg: String): Unit = {
@@ -170,7 +203,8 @@ object SjsonnetMainBase {
170203
warn,
171204
std,
172205
debugStats = debugStats,
173-
profileOpt = config.profile
206+
profileOpt = config.profile,
207+
stdoutStream = if (rawOutputStream != null) rawOutputStream else stdout
174208
)
175209
res <- {
176210
if (hasWarnings && config.fatalWarnings.value) Left("")
@@ -185,7 +219,20 @@ object SjsonnetMainBase {
185219
if (err.nonEmpty) stderr.println(err)
186220
1
187221
case Right((config, str)) =>
188-
if (str.nonEmpty) {
222+
if (str eq ByteRenderedSentinel) {
223+
// Output was already written directly to stdout via byte pipeline.
224+
// Handle trailing newline.
225+
if (config.multi.isDefined || !config.noTrailingNewline.value) {
226+
if (rawOutputStream != null) {
227+
rawOutputStream.write('\n')
228+
rawOutputStream.flush()
229+
} else {
230+
stdout.write('\n')
231+
stdout.flush()
232+
}
233+
} else if (rawOutputStream != null) rawOutputStream.flush()
234+
else stdout.flush()
235+
} else if (str.nonEmpty) {
189236
config.outputFile match {
190237
case None =>
191238
// In multi mode, the file list on stdout always ends with a newline,
@@ -263,12 +310,37 @@ object SjsonnetMainBase {
263310
jsonnetCode: String,
264311
path: os.Path,
265312
wd: os.Path,
266-
getCurrentPosition: () => Position) = {
267-
writeToFile(config, wd) { writer =>
268-
val renderer = rendererForConfig(writer, config, getCurrentPosition)
269-
val res = interp.interpret0(jsonnetCode, OsPath(path), renderer)
270-
if (config.yamlOut.value && !config.noTrailingNewline.value) writer.write('\n')
271-
res
313+
getCurrentPosition: () => Position,
314+
stdoutStream: OutputStream) = {
315+
config.outputFile match {
316+
case Some(f) if !config.yamlOut.value && !config.expectString.value =>
317+
// Byte[] fast path: render directly to OutputStream, bypassing OutputStreamWriter.
318+
// ByteBuilder handles buffering internally (8KB threshold), no BufferedOutputStream needed.
319+
handleWriteFile(
320+
os.write.over.outputStream(os.Path(f, wd), createFolders = config.createDirs.value)
321+
).flatMap { out =>
322+
try {
323+
val renderer = new ByteRenderer(out, indent = config.indent)
324+
val res = interp.interpret0(jsonnetCode, OsPath(path), renderer)
325+
out.flush()
326+
res.map(_ => "")
327+
} finally out.close()
328+
}
329+
case None if stdoutStream != null && !config.yamlOut.value && !config.expectString.value =>
330+
// Byte[] fast path for stdout: render directly to OutputStream,
331+
// bypassing StringWriter → String → println chain.
332+
val renderer = new ByteRenderer(stdoutStream, indent = config.indent)
333+
val res = interp.interpret0(jsonnetCode, OsPath(path), renderer)
334+
stdoutStream.flush()
335+
// Return sentinel to signal main0 that output was already written.
336+
res.map(_ => ByteRenderedSentinel)
337+
case _ =>
338+
writeToFile(config, wd) { writer =>
339+
val renderer = rendererForConfig(writer, config, getCurrentPosition)
340+
val res = interp.interpret0(jsonnetCode, OsPath(path), renderer)
341+
if (config.yamlOut.value && !config.noTrailingNewline.value) writer.write('\n')
342+
res
343+
}
272344
}
273345
}
274346

@@ -320,7 +392,8 @@ object SjsonnetMainBase {
320392
std: Val.Obj,
321393
evaluatorOverride: Option[Evaluator] = None,
322394
debugStats: DebugStats = null,
323-
profileOpt: Option[String] = None): Either[String, String] = {
395+
profileOpt: Option[String] = None,
396+
stdoutStream: OutputStream = null): Either[String, String] = {
324397

325398
val (jsonnetCode, path) =
326399
if (config.exec.value) (file, wd / Util.wrapInLessThanGreaterThan("exec"))
@@ -455,9 +528,11 @@ object SjsonnetMainBase {
455528
Right("")
456529
}
457530

458-
case _ => renderNormal(config, interp, jsonnetCode, path, wd, () => currentPos)
531+
case _ =>
532+
renderNormal(config, interp, jsonnetCode, path, wd, () => currentPos, stdoutStream)
459533
}
460-
case _ => renderNormal(config, interp, jsonnetCode, path, wd, () => currentPos)
534+
case _ =>
535+
renderNormal(config, interp, jsonnetCode, path, wd, () => currentPos, stdoutStream)
461536
}
462537

463538
if (profilerInstance != null)
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
package sjsonnet;
2+
3+
import java.lang.invoke.MethodHandles;
4+
import java.lang.invoke.VarHandle;
5+
import java.nio.ByteOrder;
6+
import java.nio.charset.StandardCharsets;
7+
8+
/**
9+
* SWAR (SIMD Within A Register) escape-char scanner for JSON string rendering.
10+
*
11+
* <p>Detects characters requiring JSON escaping: control chars ({@code < 32}),
12+
* double-quote ({@code '"'}), and backslash ({@code '\\'}).
13+
*
14+
* <p>For strings above a threshold length, converts to ISO-8859-1 bytes and
15+
* processes 8 bytes at a time using {@link VarHandle} bulk reads + Hacker's
16+
* Delight zero-detection formula. For shorter strings, uses a scalar charAt loop.
17+
*
18+
* <p>Based on the SWAR technique from Hacker's Delight Ch. 6, as used by
19+
* <a href="https://github.com/netty/netty/blob/4.2/common/src/main/java/io/netty/util/internal/SWARUtil.java">
20+
* Netty SWARUtil</a> and
21+
* <a href="https://github.com/apache/pekko/blob/main/actor/src/main/scala/org/apache/pekko/util/SWARUtil.scala">
22+
* Apache Pekko SWARUtil</a>.
23+
*
24+
* @see <a href="https://richardstartin.github.io/posts/finding-bytes">Finding Bytes in Arrays</a>
25+
*/
26+
final class CharSWAR {
27+
private CharSWAR() {}
28+
29+
// VarHandle for reading longs from byte[] — replaces sun.misc.Unsafe.
30+
// Following Netty VarHandleFactory pattern:
31+
// MethodHandles.byteArrayViewVarHandle(long[].class, ByteOrder)
32+
private static final VarHandle LONG_VIEW =
33+
MethodHandles.byteArrayViewVarHandle(long[].class, ByteOrder.nativeOrder());
34+
35+
// --- 8-bit SWAR constants (Netty/Pekko pattern) ---
36+
//
37+
// Hacker's Delight zero-detection for 8-bit lanes:
38+
// input = word ^ pattern // zero bytes where byte matches
39+
// tmp = (input & 0x7F7F...) + 0x7F7F... // carry into bit 7 iff non-zero
40+
// result = ~(tmp | input | 0x7F7F...) // bit 7 set iff lane was zero
41+
42+
private static final long HOLE = 0x7F7F_7F7F_7F7F_7F7FL;
43+
44+
/** Broadcast '"' (0x22) to all 8 byte lanes. */
45+
private static final long QUOTE = 0x2222_2222_2222_2222L;
46+
47+
/** Broadcast '\\' (0x5C) to all 8 byte lanes. */
48+
private static final long BSLAS = 0x5C5C_5C5C_5C5C_5C5CL;
49+
50+
/** Mask for bits 5-7 of each byte; zero result means byte < 32. */
51+
private static final long CTRL = 0xE0E0_E0E0_E0E0_E0E0L;
52+
53+
/** Below this length, scalar charAt is faster than SWAR + byte[] conversion. */
54+
private static final int SWAR_THRESHOLD = 128;
55+
56+
/**
57+
* Check if any char in {@code str} needs JSON string escaping.
58+
* Scan-first API: call on the String before copying to the output buffer.
59+
*/
60+
static boolean hasEscapeChar(String str) {
61+
int len = str.length();
62+
if (len < SWAR_THRESHOLD) {
63+
return hasEscapeCharScalar(str, len);
64+
}
65+
// ISO-8859-1 encoding is a JVM intrinsic for LATIN1 compact strings —
66+
// essentially a memcpy of the internal byte[]. Chars > 255 map to '?'
67+
// (0x3F), which is safe (not a control char, not '"', not '\\').
68+
byte[] bytes = str.getBytes(StandardCharsets.ISO_8859_1);
69+
return hasEscapeCharSWAR(bytes, 0, bytes.length);
70+
}
71+
72+
/**
73+
* Check if any byte in {@code arr[from..to)} needs JSON string escaping.
74+
* Used by ByteRenderer for in-place SWAR scan on byte[] buffers.
75+
* UTF-8 multi-byte sequences never produce bytes matching '"', '\\', or &lt; 0x20,
76+
* so this is safe for scanning UTF-8 encoded data.
77+
*/
78+
static boolean hasEscapeChar(byte[] arr, int from, int to) {
79+
return hasEscapeCharSWAR(arr, from, to);
80+
}
81+
82+
/**
83+
* Check if any char in {@code arr[from..to)} needs JSON string escaping.
84+
*/
85+
static boolean hasEscapeChar(char[] arr, int from, int to) {
86+
for (int i = from; i < to; i++) {
87+
char c = arr[i];
88+
if (c < 32 || c == '"' || c == '\\') return true;
89+
}
90+
return false;
91+
}
92+
93+
private static boolean hasEscapeCharSWAR(byte[] arr, int from, int to) {
94+
int i = from;
95+
int limit = to - 7; // 8 bytes per VarHandle.get
96+
while (i < limit) {
97+
long word = (long) LONG_VIEW.get(arr, i);
98+
if (swarHasMatch(word)) return true;
99+
i += 8;
100+
}
101+
// Tail: remaining 0-7 bytes
102+
while (i < to) {
103+
int b = arr[i] & 0xFF;
104+
if (b < 32 || b == '"' || b == '\\') return true;
105+
i++;
106+
}
107+
return false;
108+
}
109+
110+
/**
111+
* 8-bit SWAR: returns true if any byte lane in {@code word}
112+
* contains '"' (0x22), '\\' (0x5C), or a control char (&lt; 0x20).
113+
*
114+
* <p>Uses Netty/Pekko pattern: XOR to produce zero lanes, then
115+
* Hacker's Delight formula to detect zero bytes.
116+
*/
117+
private static boolean swarHasMatch(long word) {
118+
// 1. Detect '"' via XOR + zero-detection (Netty SWARUtil.applyPattern)
119+
long q = word ^ QUOTE;
120+
long qz = ~((q & HOLE) + HOLE | q | HOLE);
121+
122+
// 2. Detect '\\' via XOR + zero-detection
123+
long b = word ^ BSLAS;
124+
long bz = ~((b & HOLE) + HOLE | b | HOLE);
125+
126+
// 3. Detect control chars: byte & 0xE0 == 0 means bits 5-7 all zero → c < 32
127+
long c = word & CTRL;
128+
long cz = ~((c & HOLE) + HOLE | c | HOLE);
129+
130+
return (qz | bz | cz) != 0L;
131+
}
132+
133+
/** Scalar scan for String (used for short strings). */
134+
private static boolean hasEscapeCharScalar(String s, int len) {
135+
for (int i = 0; i < len; i++) {
136+
char c = s.charAt(i);
137+
if (c < 32 || c == '"' || c == '\\') return true;
138+
}
139+
return false;
140+
}
141+
}

0 commit comments

Comments
 (0)