benchsuite: add xctrace profiling and decision metadata by john7rho · Pull Request #3298 · BurntSushi/ripgrep

john7rho · 2026-03-12T16:15:45Z

Summary

add decision metadata to arm_bench.py CSV/JSON output so each sample explains thread, mmap, and search-strategy choices
add optional xctrace profiling support with per-sample trace/export/summary artifacts and CLI controls for scenario/sample selection
document the ARM harness workflow and treat successful xctrace runs that return non-zero as usable artifacts instead of false warnings

Details

This updates the Apple Silicon benchmark harness so its output is self-explaining and profiling can be captured as part of the normal benchmark flow.

Decision metadata is emitted on each sample row and in JSON config/sample records, including fields like threads_selected, detected Apple P-core count, auto_mmap_enabled, multiline_with_matcher, and search_strategy.

Profiling is exposed through:

--profile time-profiler|system-trace|poi
--profile-scenarios ...
--profile-samples N
--profile-on-best-delta

The harness writes .trace bundles plus compact JSON summaries and XML exports alongside benchmark output.

I also verified an xcrun xctrace CLI quirk on Xcode 26 where successful non-interactive recordings can return exit code 54 while still printing Recording completed and producing a valid trace. The harness now records stdout/stderr for those runs and treats that specific shape as success instead of a profiling warning.

Testing

PYTHONPYCACHEPREFIX=/tmp/pycache python3 -m py_compile benchsuite/arm_bench.py benchsuite/test_arm_bench.py
PYTHONPYCACHEPREFIX=/tmp/pycache python3 -m unittest benchsuite.test_arm_bench
local fork-vs-upstream directory_io benchmark run with --profile time-profiler

Three targeted performance improvements: 1. Multiline reader buffer (searcher/mod.rs): Pre-allocate capacity up to the heap limit once upfront instead of repeatedly doubling via resize. This eliminates O(n) reallocation copies during buffer growth since subsequent resize calls only zero-fill within existing capacity. 2. Multiline per-match printer (printer/standard.rs): Precompute line boundaries once before iterating matches, replacing the O(N*M) pattern of re-walking all lines from byte 0 for each match with an O(N+M) indexed lookup. 3. First-match-only fast path (printer/standard.rs): When only --column is needed (no coloring, replacement, per-match, only-matching, or stats), stop after finding the first match instead of materializing all matches into the Vec<Match>. This avoids unnecessary regex work in column-only mode with dense matches. https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv

This reverts commit 37b28db.

…ndant regex re-search The multiline printer was re-executing the regex via find_iter_at_in_context() to rediscover individual match positions that the searcher had already found. This was ~38% of total runtime for match-heavy multiline searches. Now the MultiLine searcher accumulates raw match positions as it groups adjacent matches, and passes them to the printer via SinkMatch. The printer uses these positions directly when available, falling back to re-searching only for line-by-line mode (where the searcher doesn't provide positions). Benchmarks (166K multiline matches): - Colored output: 1.63x faster (122ms → 75ms) - Vimgrep mode: 1.63x faster (119ms → 73ms) - Only-matching: 2.05x faster (142ms → 69ms) - No regression on single-line or no-match cases. https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv

Updated description to specify optimization for ARM architecture.

Added a note about the development team behind blitzgrep.

The macOS mmap path was unconditionally disabled based on benchmarks from the pre-Apple Silicon era (2016-2022). On modern Apple Silicon (M-series), mmap on warm cache eliminates the read_to_end overhead that dominated multiline search time. Measured on M5, 92MB file, warm page cache: - Multiline sparse (1 match): 16.0ms -> 8.8ms (1.82x faster) - Multiline dense (2M matches): 68ms -> 60ms (1.13x faster) - Multiline now at parity with line-by-line for sparse matches 434 tests pass, 0 failures.

searcher: re-enable mmap on macOS for multiline performance

Clarified description of blitzgrep as a drop-in replacement.

The default thread heuristic min(available_parallelism, 12) resolves to 10 on an M5, which over-subscribes for I/O-bound directory searches. Apple Silicon's asymmetric P/E-core architecture means extra threads beyond the P-core count add kernel contention without throughput gain. Benchmarks on /usr/share (large directory, warm cache, M5): - 4 threads: 160ms wall, 486ms sys (optimal) - 6 threads: 163ms wall, 749ms sys (new default) - 10 threads: 224ms wall, 1091ms sys (old default, 37% slower) Cap at 6 via #[cfg(all(target_os = "macos", target_arch = "aarch64"))] to provide headroom for larger machines while avoiding the contention observed at 10+ threads. Other platforms keep the existing cap of 12. 320 tests pass, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

core: lower default thread cap to 6 on macOS Apple Silicon

taskpolicy -c only accepts utility/background/maintenance, not user-interactive. The invalid argument caused taskpolicy to exit immediately with code 64, so all benchmarks measured ~1.8ms error time instead of actual rg performance (~0.15-2.1s). Disable taskpolicy wrapping since default QoS already schedules on P-cores for interactive workloads.

Benchmarks on M4 (4P+6E) show j4 is optimal for directory searches. Going beyond 4 threads spills onto E-cores, nearly doubling wall time: - directory_io rare: 1.01s → 0.59s (-42%) - directory_io common: 1.02s → 0.75s (-27%) - directory_io no-match: 1.02s → 0.59s (-42%) No regressions in single-file or multiline scenarios. Users on larger chips (M4 Pro/Max) can override with --threads N.

Replace hardcoded thread cap of 4 with runtime detection via sysctlbyname("hw.perflevel0.logicalcpu"). This adapts the thread count to the actual hardware: M4 (4P), M4 Pro (10P), M4 Max (12P), M3 Ultra (16P), etc. Falls back to 4 (minimum P-core count across all Apple Silicon) if the sysctl call fails.

Add/arm bench

…orruption The previous code mutated self.match_ranges in-place with saturating_sub to rebase ranges relative to the current line. This destructive mutation corrupted the ranges for any subsequent sink_matched call (e.g., when after_context_by_line triggers a sink_matched before the intended one). This fix uses a separate adjusted_match_ranges Vec so the original ranges are never modified, and clears match_ranges after use to prevent stale data from being reused.

Same fix as the non-adjacent match case: defer set_match_ranges until after sink_context completes, so that after_context_by_line cannot consume the ranges meant for the actual match.

…ance-pRYtS * perf: optimize multiline search and printing paths Three targeted performance improvements: 1. Multiline reader buffer (searcher/mod.rs): Pre-allocate capacity up to the heap limit once upfront instead of repeatedly doubling via resize. This eliminates O(n) reallocation copies during buffer growth since subsequent resize calls only zero-fill within existing capacity. 2. Multiline per-match printer (printer/standard.rs): Precompute line boundaries once before iterating matches, replacing the O(N*M) pattern of re-walking all lines from byte 0 for each match with an O(N+M) indexed lookup. 3. First-match-only fast path (printer/standard.rs): When only --column is needed (no coloring, replacement, per-match, only-matching, or stats), stop after finding the first match instead of materializing all matches into the Vec<Match>. This avoids unnecessary regex work in column-only mode with dense matches. https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv * Revert "perf: optimize multiline search and printing paths" This reverts commit 37b28db. * perf: pass match positions from searcher to printer, eliminating redundant regex re-search The multiline printer was re-executing the regex via find_iter_at_in_context() to rediscover individual match positions that the searcher had already found. This was ~38% of total runtime for match-heavy multiline searches. Now the MultiLine searcher accumulates raw match positions as it groups adjacent matches, and passes them to the printer via SinkMatch. The printer uses these positions directly when available, falling back to re-searching only for line-by-line mode (where the searcher doesn't provide positions). Benchmarks (166K multiline matches): - Colored output: 1.63x faster (122ms → 75ms) - Vimgrep mode: 1.63x faster (119ms → 73ms) - Only-matching: 2.05x faster (142ms → 69ms) - No regression on single-line or no-match cases. https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv * fix: use separate Vec for adjusted match_ranges to prevent in-place corruption The previous code mutated self.match_ranges in-place with saturating_sub to rebase ranges relative to the current line. This destructive mutation corrupted the ranges for any subsequent sink_matched call (e.g., when after_context_by_line triggers a sink_matched before the intended one). This fix uses a separate adjusted_match_ranges Vec so the original ranges are never modified, and clears match_ranges after use to prevent stale data from being reused. * fix: set match_ranges after sink_context in MultiLine::run() final flush Same fix as the non-adjacent match case: defer set_match_ranges until after sink_context completes, so that after_context_by_line cannot consume the ranges meant for the actual match. --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Updated the attribution of the development team to include a hyperlink.

Updated the link to Byzantine Labs in the README.

…ranges searcher: skip unused multiline match range bookkeeping

On Apple Silicon, the SliceByLine strategy (used when mmap is enabled for single-line patterns) faults in the entire file, while the streaming ReadByLine path keeps the working set in L1/L2 cache via a 64KB buffer. This yields ~35% faster single-file searches with --mmap. Multiline searches still use mmap where it avoids a heap copy. Non-Apple-Silicon platforms are completely unaffected. Benchmarked on M4 (independently verified): - mmap_vs_read --mmap: -34.5% to -35.6% (p < 0.0001) - mmap_multiline literal --mmap: -34.5% to -35.9% (p < 0.0001) - No regressions in any scenario

searcher: skip mmap for single-line searches on Apple Silicon

…anges searcher: retune auto-mmap and inline multiline ranges

…e-multiline-ranges

When case_insensitive is enabled and all patterns are pure ASCII literals (no metacharacters), bypass the full regex parse/translate path and directly emit HIR character classes (e.g. `a` -> `[aA]`). This lets the regex engine extract literals for its prefilter, recovering the SIMD memchr fast path that is currently lost when `-i` is used. Previously, case-insensitive patterns went through the generic regex parser which produced opaque HIR that blocked literal extraction, causing a 1.4x slowdown vs case-sensitive. Also removes dead commented-out mmap code from earlier evaluation.

…unicode mode The new ASCII case-insensitive literal fast path had two bugs: 1. It skipped ban::check, allowing patterns with banned bytes (e.g. NUL) to bypass the ban validation that the normal regex path enforces. 2. It was enabled even when config.unicode was true (the default), but only performed ASCII case folding (k → [kK]). The standard regex translator applies full Unicode case folding (k also matches K U+212A), so the fast path could produce false negatives. Fix: reject patterns with banned bytes in the gate function, and only enable the fast path when unicode mode is disabled.

benchsuite/bench.sh: Self-contained script that builds both upstream and fork binaries, downloads the Linux kernel corpus if needed, and runs 7 interleaved A/B benchmarks via hyperfine with --export-json. Also measures peak RSS and computes throughput. benchsuite/results/: JSON results from the initial benchmark run (hyperfine 1.20.0, 15 runs, 5 warmups) on Apple M5 with 24 GB unified memory, macOS 15.x.

… mode When `ignore_whitespace=true` (regex verbose/`x` mode), the standard parser treats spaces as insignificant and `#` as a comment marker. The ASCII case-insensitive fast path bypasses the parser entirely, so it would incorrectly treat spaces and `#` as literal characters, producing wrong match semantics. Bail out of the fast path when verbose mode is on.

…lter * regex: add ASCII case-insensitive literal prefilter bypass When case_insensitive is enabled and all patterns are pure ASCII literals (no metacharacters), bypass the full regex parse/translate path and directly emit HIR character classes (e.g. `a` -> `[aA]`). This lets the regex engine extract literals for its prefilter, recovering the SIMD memchr fast path that is currently lost when `-i` is used. Previously, case-insensitive patterns went through the generic regex parser which produced opaque HIR that blocked literal extraction, causing a 1.4x slowdown vs case-sensitive. Also removes dead commented-out mmap code from earlier evaluation. * fix: guard ASCII case-insensitive fast path against banned bytes and unicode mode The new ASCII case-insensitive literal fast path had two bugs: 1. It skipped ban::check, allowing patterns with banned bytes (e.g. NUL) to bypass the ban validation that the normal regex path enforces. 2. It was enabled even when config.unicode was true (the default), but only performed ASCII case folding (k → [kK]). The standard regex translator applies full Unicode case folding (k also matches K U+212A), so the fast path could produce false negatives. Fix: reject patterns with banned bytes in the gate function, and only enable the fast path when unicode mode is disabled. * fix: guard ASCII case-insensitive fast path against ignore_whitespace mode When `ignore_whitespace=true` (regex verbose/`x` mode), the standard parser treats spaces as insignificant and `#` as a comment marker. The ASCII case-insensitive fast path bypasses the parser entirely, so it would incorrectly treat spaces and `#` as literal characters, producing wrong match semantics. Bail out of the fast path when verbose mode is on. --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

The prefilter was gated behind `if self.unicode { return false; }`, meaning it never activated in the default unicode=true mode. Only 4 ASCII bytes (K, k, S, s) have non-ASCII Unicode case folds, so replace the blanket guard with a per-byte check. Patterns without those bytes now get the SIMD literal prefilter with -i, recovering a ~1.75x penalty.

Add reproducible A/B benchmark script for Apple Silicon

regex: allow ASCII case-insensitive prefilter in unicode mode

Remove dead taskpolicy code and unreachable cold-cache check. Move correctness verification outside the measurement loop to avoid perturbing cache/thermal state. Fix CSV iter column (was constant, now sample index) and lines column (was out-of-bounds index, now uses single expected value). Fix power calculation to use means instead of medians for Cohen's d, and round U statistic instead of truncating. Add --seed flag for reproducible benchmark ordering, baseline comparison warnings, CPU isolation note for Apple Silicon, and adaptive thread counts based on P-core topology.

fix: address 10 QA issues in arm_bench.py

The existing `benchsuite/benchsuite --download linux` command tries to build the Linux kernel after cloning, which fails on macOS due to incompatible linker/toolchain. Since arm_bench.py only needs the source tree for search benchmarks (not a built kernel), this adds a `--download` flag directly to arm_bench.py that clones without building. Supports: --download linux, --download subtitles-en, --download all. Idempotent -- skips corpora that are already present.

fix: add --download to arm_bench.py for macOS corpus setup

bench: tighten arm benchmark reliability controls

Record page faults

Add p-core contention scenario

claude and others added 30 commits March 7, 2026 14:33

Revert "perf: optimize multiline search and printing paths"

25646f4

This reverts commit 37b28db.

Update README.md

0b6ab0b

Clarify blitzgrep optimization for ARM architecture

bb711e9

Updated description to specify optimization for ARM architecture.

Mention development team in README

71745de

Added a note about the development team behind blitzgrep.

Merge pull request #2 from john7rho/optimize-multiline-performance

1371f66

searcher: re-enable mmap on macOS for multiline performance

Update README to improve blitzgrep description

4c1c644

Clarified description of blitzgrep as a drop-in replacement.

Merge pull request #3 from john7rho/optimize-multiline-performance

0a21588

core: lower default thread cap to 6 on macOS Apple Silicon

Add ARM bench

a49fa60

Fix correctness bug

a1dbc6c

Merge pull request #14 from john7rho/add/arm-bench

403993f

Add/arm bench

fix: set match_ranges after sink_context in MultiLine::run() final flush

9a47572

Same fix as the non-adjacent match case: defer set_match_ranges until after sink_context completes, so that after_context_by_line cannot consume the ranges meant for the actual match.

Update development team attribution in README

0d90ac8

Updated the attribution of the development team to include a hyperlink.

Fix link format for Byzantine Labs

06b68d6

Updated the link to Byzantine Labs in the README.

searcher: skip unused multiline match range bookkeeping

7d25ea2

Merge pull request #15 from john7rho/fix/skip-unused-multiline-match-…

1bd0985

…ranges searcher: skip unused multiline match range bookkeeping

Run cargo fmt --all

dee8738

Merge pull request #16 from john7rho/mmap-skip-singleline-apple-silicon

e5d082d

searcher: skip mmap for single-line searches on Apple Silicon

searcher: retune auto-mmap and inline multiline ranges

48c45ed

Merge pull request #18 from john7rho/opt/auto-mmap-inline-multiline-r…

0ff4633

…anges searcher: retune auto-mmap and inline multiline ranges

Merge remote-tracking branch 'origin/master' into opt/auto-mmap-inlin…

589a33e

…e-multiline-ranges

john7rho and others added 20 commits March 8, 2026 22:39

Merge pull request #20 from john7rho/add/bench-script

b56d570

Add reproducible A/B benchmark script for Apple Silicon

Merge pull request #21 from john7rho/fix/ascii-prefilter-unicode-mode

3b3cf95

regex: allow ASCII case-insensitive prefilter in unicode mode

bench: add end-to-end A/B arm benchmark workflow

534bdd0

Merge pull request #23 from john7rho/fix/arm-bench-qa-fixes

cd5f58a

fix: address 10 QA issues in arm_bench.py

Merge pull request #24 from john7rho/fix/arm-bench-download-macos

716df80

fix: add --download to arm_bench.py for macOS corpus setup

bench: tighten arm benchmark reliability controls

311b8b8

Merge pull request #26 from john7rho/fix/arm-bench-reliability-master

d2aa466

bench: tighten arm benchmark reliability controls

Record page faults

dc735ba

Merge pull request #27 from john7rho/record/page-faults

5b34828

Record page faults

Add p-core contention scenario to benchmark

a341fe2

Merge pull request #29 from john7rho/record/page-faults

2ef7976

Add p-core contention scenario

benchsuite: add xctrace profiling and decision metadata

75bc802

john7rho closed this Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

benchsuite: add xctrace profiling and decision metadata#3298

benchsuite: add xctrace profiling and decision metadata#3298
john7rho wants to merge 50 commits intoBurntSushi:masterfrom
john7rho:benchsuite/xctrace-decision-metadata

john7rho commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

john7rho commented Mar 12, 2026

Summary

Details

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants