benchsuite: add xctrace profiling and decision metadata#3298
Closed
john7rho wants to merge 50 commits intoBurntSushi:masterfrom
Closed
benchsuite: add xctrace profiling and decision metadata#3298john7rho wants to merge 50 commits intoBurntSushi:masterfrom
john7rho wants to merge 50 commits intoBurntSushi:masterfrom
Conversation
Three targeted performance improvements: 1. Multiline reader buffer (searcher/mod.rs): Pre-allocate capacity up to the heap limit once upfront instead of repeatedly doubling via resize. This eliminates O(n) reallocation copies during buffer growth since subsequent resize calls only zero-fill within existing capacity. 2. Multiline per-match printer (printer/standard.rs): Precompute line boundaries once before iterating matches, replacing the O(N*M) pattern of re-walking all lines from byte 0 for each match with an O(N+M) indexed lookup. 3. First-match-only fast path (printer/standard.rs): When only --column is needed (no coloring, replacement, per-match, only-matching, or stats), stop after finding the first match instead of materializing all matches into the Vec<Match>. This avoids unnecessary regex work in column-only mode with dense matches. https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv
This reverts commit 37b28db.
…ndant regex re-search The multiline printer was re-executing the regex via find_iter_at_in_context() to rediscover individual match positions that the searcher had already found. This was ~38% of total runtime for match-heavy multiline searches. Now the MultiLine searcher accumulates raw match positions as it groups adjacent matches, and passes them to the printer via SinkMatch. The printer uses these positions directly when available, falling back to re-searching only for line-by-line mode (where the searcher doesn't provide positions). Benchmarks (166K multiline matches): - Colored output: 1.63x faster (122ms → 75ms) - Vimgrep mode: 1.63x faster (119ms → 73ms) - Only-matching: 2.05x faster (142ms → 69ms) - No regression on single-line or no-match cases. https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv
Updated description to specify optimization for ARM architecture.
Added a note about the development team behind blitzgrep.
The macOS mmap path was unconditionally disabled based on benchmarks from the pre-Apple Silicon era (2016-2022). On modern Apple Silicon (M-series), mmap on warm cache eliminates the read_to_end overhead that dominated multiline search time. Measured on M5, 92MB file, warm page cache: - Multiline sparse (1 match): 16.0ms -> 8.8ms (1.82x faster) - Multiline dense (2M matches): 68ms -> 60ms (1.13x faster) - Multiline now at parity with line-by-line for sparse matches 434 tests pass, 0 failures.
searcher: re-enable mmap on macOS for multiline performance
Clarified description of blitzgrep as a drop-in replacement.
The default thread heuristic min(available_parallelism, 12) resolves to 10 on an M5, which over-subscribes for I/O-bound directory searches. Apple Silicon's asymmetric P/E-core architecture means extra threads beyond the P-core count add kernel contention without throughput gain. Benchmarks on /usr/share (large directory, warm cache, M5): - 4 threads: 160ms wall, 486ms sys (optimal) - 6 threads: 163ms wall, 749ms sys (new default) - 10 threads: 224ms wall, 1091ms sys (old default, 37% slower) Cap at 6 via #[cfg(all(target_os = "macos", target_arch = "aarch64"))] to provide headroom for larger machines while avoiding the contention observed at 10+ threads. Other platforms keep the existing cap of 12. 320 tests pass, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
core: lower default thread cap to 6 on macOS Apple Silicon
taskpolicy -c only accepts utility/background/maintenance, not user-interactive. The invalid argument caused taskpolicy to exit immediately with code 64, so all benchmarks measured ~1.8ms error time instead of actual rg performance (~0.15-2.1s). Disable taskpolicy wrapping since default QoS already schedules on P-cores for interactive workloads.
Benchmarks on M4 (4P+6E) show j4 is optimal for directory searches. Going beyond 4 threads spills onto E-cores, nearly doubling wall time: - directory_io rare: 1.01s → 0.59s (-42%) - directory_io common: 1.02s → 0.75s (-27%) - directory_io no-match: 1.02s → 0.59s (-42%) No regressions in single-file or multiline scenarios. Users on larger chips (M4 Pro/Max) can override with --threads N.
Replace hardcoded thread cap of 4 with runtime detection via
sysctlbyname("hw.perflevel0.logicalcpu"). This adapts the thread
count to the actual hardware: M4 (4P), M4 Pro (10P), M4 Max (12P),
M3 Ultra (16P), etc.
Falls back to 4 (minimum P-core count across all Apple Silicon)
if the sysctl call fails.
Add/arm bench
…orruption The previous code mutated self.match_ranges in-place with saturating_sub to rebase ranges relative to the current line. This destructive mutation corrupted the ranges for any subsequent sink_matched call (e.g., when after_context_by_line triggers a sink_matched before the intended one). This fix uses a separate adjusted_match_ranges Vec so the original ranges are never modified, and clears match_ranges after use to prevent stale data from being reused.
Same fix as the non-adjacent match case: defer set_match_ranges until after sink_context completes, so that after_context_by_line cannot consume the ranges meant for the actual match.
…ance-pRYtS * perf: optimize multiline search and printing paths Three targeted performance improvements: 1. Multiline reader buffer (searcher/mod.rs): Pre-allocate capacity up to the heap limit once upfront instead of repeatedly doubling via resize. This eliminates O(n) reallocation copies during buffer growth since subsequent resize calls only zero-fill within existing capacity. 2. Multiline per-match printer (printer/standard.rs): Precompute line boundaries once before iterating matches, replacing the O(N*M) pattern of re-walking all lines from byte 0 for each match with an O(N+M) indexed lookup. 3. First-match-only fast path (printer/standard.rs): When only --column is needed (no coloring, replacement, per-match, only-matching, or stats), stop after finding the first match instead of materializing all matches into the Vec<Match>. This avoids unnecessary regex work in column-only mode with dense matches. https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv * Revert "perf: optimize multiline search and printing paths" This reverts commit 37b28db. * perf: pass match positions from searcher to printer, eliminating redundant regex re-search The multiline printer was re-executing the regex via find_iter_at_in_context() to rediscover individual match positions that the searcher had already found. This was ~38% of total runtime for match-heavy multiline searches. Now the MultiLine searcher accumulates raw match positions as it groups adjacent matches, and passes them to the printer via SinkMatch. The printer uses these positions directly when available, falling back to re-searching only for line-by-line mode (where the searcher doesn't provide positions). Benchmarks (166K multiline matches): - Colored output: 1.63x faster (122ms → 75ms) - Vimgrep mode: 1.63x faster (119ms → 73ms) - Only-matching: 2.05x faster (142ms → 69ms) - No regression on single-line or no-match cases. https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv * fix: use separate Vec for adjusted match_ranges to prevent in-place corruption The previous code mutated self.match_ranges in-place with saturating_sub to rebase ranges relative to the current line. This destructive mutation corrupted the ranges for any subsequent sink_matched call (e.g., when after_context_by_line triggers a sink_matched before the intended one). This fix uses a separate adjusted_match_ranges Vec so the original ranges are never modified, and clears match_ranges after use to prevent stale data from being reused. * fix: set match_ranges after sink_context in MultiLine::run() final flush Same fix as the non-adjacent match case: defer set_match_ranges until after sink_context completes, so that after_context_by_line cannot consume the ranges meant for the actual match. --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Updated the attribution of the development team to include a hyperlink.
Updated the link to Byzantine Labs in the README.
…ranges searcher: skip unused multiline match range bookkeeping
On Apple Silicon, the SliceByLine strategy (used when mmap is enabled for single-line patterns) faults in the entire file, while the streaming ReadByLine path keeps the working set in L1/L2 cache via a 64KB buffer. This yields ~35% faster single-file searches with --mmap. Multiline searches still use mmap where it avoids a heap copy. Non-Apple-Silicon platforms are completely unaffected. Benchmarked on M4 (independently verified): - mmap_vs_read --mmap: -34.5% to -35.6% (p < 0.0001) - mmap_multiline literal --mmap: -34.5% to -35.9% (p < 0.0001) - No regressions in any scenario
searcher: skip mmap for single-line searches on Apple Silicon
…anges searcher: retune auto-mmap and inline multiline ranges
…e-multiline-ranges
When case_insensitive is enabled and all patterns are pure ASCII literals (no metacharacters), bypass the full regex parse/translate path and directly emit HIR character classes (e.g. `a` -> `[aA]`). This lets the regex engine extract literals for its prefilter, recovering the SIMD memchr fast path that is currently lost when `-i` is used. Previously, case-insensitive patterns went through the generic regex parser which produced opaque HIR that blocked literal extraction, causing a 1.4x slowdown vs case-sensitive. Also removes dead commented-out mmap code from earlier evaluation.
…unicode mode The new ASCII case-insensitive literal fast path had two bugs: 1. It skipped ban::check, allowing patterns with banned bytes (e.g. NUL) to bypass the ban validation that the normal regex path enforces. 2. It was enabled even when config.unicode was true (the default), but only performed ASCII case folding (k → [kK]). The standard regex translator applies full Unicode case folding (k also matches K U+212A), so the fast path could produce false negatives. Fix: reject patterns with banned bytes in the gate function, and only enable the fast path when unicode mode is disabled.
benchsuite/bench.sh: Self-contained script that builds both upstream and fork binaries, downloads the Linux kernel corpus if needed, and runs 7 interleaved A/B benchmarks via hyperfine with --export-json. Also measures peak RSS and computes throughput. benchsuite/results/: JSON results from the initial benchmark run (hyperfine 1.20.0, 15 runs, 5 warmups) on Apple M5 with 24 GB unified memory, macOS 15.x.
… mode When `ignore_whitespace=true` (regex verbose/`x` mode), the standard parser treats spaces as insignificant and `#` as a comment marker. The ASCII case-insensitive fast path bypasses the parser entirely, so it would incorrectly treat spaces and `#` as literal characters, producing wrong match semantics. Bail out of the fast path when verbose mode is on.
…lter * regex: add ASCII case-insensitive literal prefilter bypass When case_insensitive is enabled and all patterns are pure ASCII literals (no metacharacters), bypass the full regex parse/translate path and directly emit HIR character classes (e.g. `a` -> `[aA]`). This lets the regex engine extract literals for its prefilter, recovering the SIMD memchr fast path that is currently lost when `-i` is used. Previously, case-insensitive patterns went through the generic regex parser which produced opaque HIR that blocked literal extraction, causing a 1.4x slowdown vs case-sensitive. Also removes dead commented-out mmap code from earlier evaluation. * fix: guard ASCII case-insensitive fast path against banned bytes and unicode mode The new ASCII case-insensitive literal fast path had two bugs: 1. It skipped ban::check, allowing patterns with banned bytes (e.g. NUL) to bypass the ban validation that the normal regex path enforces. 2. It was enabled even when config.unicode was true (the default), but only performed ASCII case folding (k → [kK]). The standard regex translator applies full Unicode case folding (k also matches K U+212A), so the fast path could produce false negatives. Fix: reject patterns with banned bytes in the gate function, and only enable the fast path when unicode mode is disabled. * fix: guard ASCII case-insensitive fast path against ignore_whitespace mode When `ignore_whitespace=true` (regex verbose/`x` mode), the standard parser treats spaces as insignificant and `#` as a comment marker. The ASCII case-insensitive fast path bypasses the parser entirely, so it would incorrectly treat spaces and `#` as literal characters, producing wrong match semantics. Bail out of the fast path when verbose mode is on. --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The prefilter was gated behind `if self.unicode { return false; }`,
meaning it never activated in the default unicode=true mode. Only 4
ASCII bytes (K, k, S, s) have non-ASCII Unicode case folds, so replace
the blanket guard with a per-byte check. Patterns without those bytes
now get the SIMD literal prefilter with -i, recovering a ~1.75x penalty.
Add reproducible A/B benchmark script for Apple Silicon
regex: allow ASCII case-insensitive prefilter in unicode mode
Remove dead taskpolicy code and unreachable cold-cache check. Move correctness verification outside the measurement loop to avoid perturbing cache/thermal state. Fix CSV iter column (was constant, now sample index) and lines column (was out-of-bounds index, now uses single expected value). Fix power calculation to use means instead of medians for Cohen's d, and round U statistic instead of truncating. Add --seed flag for reproducible benchmark ordering, baseline comparison warnings, CPU isolation note for Apple Silicon, and adaptive thread counts based on P-core topology.
fix: address 10 QA issues in arm_bench.py
The existing `benchsuite/benchsuite --download linux` command tries to build the Linux kernel after cloning, which fails on macOS due to incompatible linker/toolchain. Since arm_bench.py only needs the source tree for search benchmarks (not a built kernel), this adds a `--download` flag directly to arm_bench.py that clones without building. Supports: --download linux, --download subtitles-en, --download all. Idempotent -- skips corpora that are already present.
fix: add --download to arm_bench.py for macOS corpus setup
bench: tighten arm benchmark reliability controls
Record page faults
Add p-core contention scenario
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
arm_bench.pyCSV/JSON output so each sample explains thread, mmap, and search-strategy choicesxctraceprofiling support with per-sample trace/export/summary artifacts and CLI controls for scenario/sample selectionxctraceruns that return non-zero as usable artifacts instead of false warningsDetails
This updates the Apple Silicon benchmark harness so its output is self-explaining and profiling can be captured as part of the normal benchmark flow.
Decision metadata is emitted on each sample row and in JSON config/sample records, including fields like
threads_selected, detected Apple P-core count,auto_mmap_enabled,multiline_with_matcher, andsearch_strategy.Profiling is exposed through:
--profile time-profiler|system-trace|poi--profile-scenarios ...--profile-samples N--profile-on-best-deltaThe harness writes
.tracebundles plus compact JSON summaries and XML exports alongside benchmark output.I also verified an
xcrun xctraceCLI quirk on Xcode 26 where successful non-interactive recordings can return exit code54while still printingRecording completedand producing a valid trace. The harness now records stdout/stderr for those runs and treats that specific shape as success instead of a profiling warning.Testing
PYTHONPYCACHEPREFIX=/tmp/pycache python3 -m py_compile benchsuite/arm_bench.py benchsuite/test_arm_bench.pyPYTHONPYCACHEPREFIX=/tmp/pycache python3 -m unittest benchsuite.test_arm_benchdirectory_iobenchmark run with--profile time-profiler