Skip to content

Record/page faults#3294

Closed
andrewschoi wants to merge 46 commits intoBurntSushi:masterfrom
john7rho:record/page-faults
Closed

Record/page faults#3294
andrewschoi wants to merge 46 commits intoBurntSushi:masterfrom
john7rho:record/page-faults

Conversation

@andrewschoi
Copy link
Copy Markdown

No description provided.

claude and others added 30 commits March 7, 2026 14:33
Three targeted performance improvements:

1. Multiline reader buffer (searcher/mod.rs): Pre-allocate capacity up
   to the heap limit once upfront instead of repeatedly doubling via
   resize. This eliminates O(n) reallocation copies during buffer growth
   since subsequent resize calls only zero-fill within existing capacity.

2. Multiline per-match printer (printer/standard.rs): Precompute line
   boundaries once before iterating matches, replacing the O(N*M)
   pattern of re-walking all lines from byte 0 for each match with
   an O(N+M) indexed lookup.

3. First-match-only fast path (printer/standard.rs): When only --column
   is needed (no coloring, replacement, per-match, only-matching, or
   stats), stop after finding the first match instead of materializing
   all matches into the Vec<Match>. This avoids unnecessary regex work
   in column-only mode with dense matches.

https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv
…ndant regex re-search

The multiline printer was re-executing the regex via find_iter_at_in_context()
to rediscover individual match positions that the searcher had already found.
This was ~38% of total runtime for match-heavy multiline searches.

Now the MultiLine searcher accumulates raw match positions as it groups
adjacent matches, and passes them to the printer via SinkMatch. The printer
uses these positions directly when available, falling back to re-searching
only for line-by-line mode (where the searcher doesn't provide positions).

Benchmarks (166K multiline matches):
- Colored output: 1.63x faster (122ms → 75ms)
- Vimgrep mode:   1.63x faster (119ms → 73ms)
- Only-matching:  2.05x faster (142ms → 69ms)
- No regression on single-line or no-match cases.

https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv
Updated description to specify optimization for ARM architecture.
Added a note about the development team behind blitzgrep.
The macOS mmap path was unconditionally disabled based on benchmarks
from the pre-Apple Silicon era (2016-2022). On modern Apple Silicon
(M-series), mmap on warm cache eliminates the read_to_end overhead
that dominated multiline search time.

Measured on M5, 92MB file, warm page cache:
- Multiline sparse (1 match): 16.0ms -> 8.8ms (1.82x faster)
- Multiline dense (2M matches): 68ms -> 60ms (1.13x faster)
- Multiline now at parity with line-by-line for sparse matches

434 tests pass, 0 failures.
searcher: re-enable mmap on macOS for multiline performance
Clarified description of blitzgrep as a drop-in replacement.
The default thread heuristic min(available_parallelism, 12) resolves
to 10 on an M5, which over-subscribes for I/O-bound directory searches.
Apple Silicon's asymmetric P/E-core architecture means extra threads
beyond the P-core count add kernel contention without throughput gain.

Benchmarks on /usr/share (large directory, warm cache, M5):
- 4 threads:  160ms wall, 486ms sys (optimal)
- 6 threads:  163ms wall, 749ms sys (new default)
- 10 threads: 224ms wall, 1091ms sys (old default, 37% slower)

Cap at 6 via #[cfg(all(target_os = "macos", target_arch = "aarch64"))]
to provide headroom for larger machines while avoiding the contention
observed at 10+ threads. Other platforms keep the existing cap of 12.

320 tests pass, 0 failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
core: lower default thread cap to 6 on macOS Apple Silicon
taskpolicy -c only accepts utility/background/maintenance, not
user-interactive. The invalid argument caused taskpolicy to exit
immediately with code 64, so all benchmarks measured ~1.8ms error
time instead of actual rg performance (~0.15-2.1s).

Disable taskpolicy wrapping since default QoS already schedules
on P-cores for interactive workloads.
Benchmarks on M4 (4P+6E) show j4 is optimal for directory searches.
Going beyond 4 threads spills onto E-cores, nearly doubling wall time:
- directory_io rare: 1.01s → 0.59s (-42%)
- directory_io common: 1.02s → 0.75s (-27%)
- directory_io no-match: 1.02s → 0.59s (-42%)

No regressions in single-file or multiline scenarios. Users on larger
chips (M4 Pro/Max) can override with --threads N.
Replace hardcoded thread cap of 4 with runtime detection via
sysctlbyname("hw.perflevel0.logicalcpu"). This adapts the thread
count to the actual hardware: M4 (4P), M4 Pro (10P), M4 Max (12P),
M3 Ultra (16P), etc.

Falls back to 4 (minimum P-core count across all Apple Silicon)
if the sysctl call fails.
…orruption

The previous code mutated self.match_ranges in-place with saturating_sub
to rebase ranges relative to the current line. This destructive mutation
corrupted the ranges for any subsequent sink_matched call (e.g., when
after_context_by_line triggers a sink_matched before the intended one).

This fix uses a separate adjusted_match_ranges Vec so the original ranges
are never modified, and clears match_ranges after use to prevent stale
data from being reused.
Same fix as the non-adjacent match case: defer set_match_ranges until
after sink_context completes, so that after_context_by_line cannot
consume the ranges meant for the actual match.
…ance-pRYtS

* perf: optimize multiline search and printing paths

Three targeted performance improvements:

1. Multiline reader buffer (searcher/mod.rs): Pre-allocate capacity up
   to the heap limit once upfront instead of repeatedly doubling via
   resize. This eliminates O(n) reallocation copies during buffer growth
   since subsequent resize calls only zero-fill within existing capacity.

2. Multiline per-match printer (printer/standard.rs): Precompute line
   boundaries once before iterating matches, replacing the O(N*M)
   pattern of re-walking all lines from byte 0 for each match with
   an O(N+M) indexed lookup.

3. First-match-only fast path (printer/standard.rs): When only --column
   is needed (no coloring, replacement, per-match, only-matching, or
   stats), stop after finding the first match instead of materializing
   all matches into the Vec<Match>. This avoids unnecessary regex work
   in column-only mode with dense matches.

https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv

* Revert "perf: optimize multiline search and printing paths"

This reverts commit 37b28db.

* perf: pass match positions from searcher to printer, eliminating redundant regex re-search

The multiline printer was re-executing the regex via find_iter_at_in_context()
to rediscover individual match positions that the searcher had already found.
This was ~38% of total runtime for match-heavy multiline searches.

Now the MultiLine searcher accumulates raw match positions as it groups
adjacent matches, and passes them to the printer via SinkMatch. The printer
uses these positions directly when available, falling back to re-searching
only for line-by-line mode (where the searcher doesn't provide positions).

Benchmarks (166K multiline matches):
- Colored output: 1.63x faster (122ms → 75ms)
- Vimgrep mode:   1.63x faster (119ms → 73ms)
- Only-matching:  2.05x faster (142ms → 69ms)
- No regression on single-line or no-match cases.

https://claude.ai/code/session_01Qcwhw3SJxupm2cnP3GPuhv

* fix: use separate Vec for adjusted match_ranges to prevent in-place corruption

The previous code mutated self.match_ranges in-place with saturating_sub
to rebase ranges relative to the current line. This destructive mutation
corrupted the ranges for any subsequent sink_matched call (e.g., when
after_context_by_line triggers a sink_matched before the intended one).

This fix uses a separate adjusted_match_ranges Vec so the original ranges
are never modified, and clears match_ranges after use to prevent stale
data from being reused.

* fix: set match_ranges after sink_context in MultiLine::run() final flush

Same fix as the non-adjacent match case: defer set_match_ranges until
after sink_context completes, so that after_context_by_line cannot
consume the ranges meant for the actual match.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Updated the attribution of the development team to include a hyperlink.
Updated the link to Byzantine Labs in the README.
…ranges

searcher: skip unused multiline match range bookkeeping
On Apple Silicon, the SliceByLine strategy (used when mmap is enabled
for single-line patterns) faults in the entire file, while the streaming
ReadByLine path keeps the working set in L1/L2 cache via a 64KB buffer.
This yields ~35% faster single-file searches with --mmap.

Multiline searches still use mmap where it avoids a heap copy.
Non-Apple-Silicon platforms are completely unaffected.

Benchmarked on M4 (independently verified):
- mmap_vs_read --mmap: -34.5% to -35.6% (p < 0.0001)
- mmap_multiline literal --mmap: -34.5% to -35.9% (p < 0.0001)
- No regressions in any scenario
searcher: skip mmap for single-line searches on Apple Silicon
…anges

searcher: retune auto-mmap and inline multiline ranges
john7rho and others added 16 commits March 8, 2026 22:39
When case_insensitive is enabled and all patterns are pure ASCII
literals (no metacharacters), bypass the full regex parse/translate
path and directly emit HIR character classes (e.g. `a` -> `[aA]`).
This lets the regex engine extract literals for its prefilter,
recovering the SIMD memchr fast path that is currently lost when
`-i` is used. Previously, case-insensitive patterns went through
the generic regex parser which produced opaque HIR that blocked
literal extraction, causing a 1.4x slowdown vs case-sensitive.

Also removes dead commented-out mmap code from earlier evaluation.
…unicode mode

The new ASCII case-insensitive literal fast path had two bugs:

1. It skipped ban::check, allowing patterns with banned bytes (e.g. NUL)
   to bypass the ban validation that the normal regex path enforces.

2. It was enabled even when config.unicode was true (the default), but
   only performed ASCII case folding (k → [kK]). The standard regex
   translator applies full Unicode case folding (k also matches K
   U+212A), so the fast path could produce false negatives.

Fix: reject patterns with banned bytes in the gate function, and only
enable the fast path when unicode mode is disabled.
benchsuite/bench.sh: Self-contained script that builds both upstream
and fork binaries, downloads the Linux kernel corpus if needed, and
runs 7 interleaved A/B benchmarks via hyperfine with --export-json.
Also measures peak RSS and computes throughput.

benchsuite/results/: JSON results from the initial benchmark run
(hyperfine 1.20.0, 15 runs, 5 warmups) on Apple M5 with 24 GB
unified memory, macOS 15.x.
… mode

When `ignore_whitespace=true` (regex verbose/`x` mode), the standard
parser treats spaces as insignificant and `#` as a comment marker. The
ASCII case-insensitive fast path bypasses the parser entirely, so it
would incorrectly treat spaces and `#` as literal characters, producing
wrong match semantics. Bail out of the fast path when verbose mode is on.
…lter

* regex: add ASCII case-insensitive literal prefilter bypass

When case_insensitive is enabled and all patterns are pure ASCII
literals (no metacharacters), bypass the full regex parse/translate
path and directly emit HIR character classes (e.g. `a` -> `[aA]`).
This lets the regex engine extract literals for its prefilter,
recovering the SIMD memchr fast path that is currently lost when
`-i` is used. Previously, case-insensitive patterns went through
the generic regex parser which produced opaque HIR that blocked
literal extraction, causing a 1.4x slowdown vs case-sensitive.

Also removes dead commented-out mmap code from earlier evaluation.

* fix: guard ASCII case-insensitive fast path against banned bytes and unicode mode

The new ASCII case-insensitive literal fast path had two bugs:

1. It skipped ban::check, allowing patterns with banned bytes (e.g. NUL)
   to bypass the ban validation that the normal regex path enforces.

2. It was enabled even when config.unicode was true (the default), but
   only performed ASCII case folding (k → [kK]). The standard regex
   translator applies full Unicode case folding (k also matches K
   U+212A), so the fast path could produce false negatives.

Fix: reject patterns with banned bytes in the gate function, and only
enable the fast path when unicode mode is disabled.

* fix: guard ASCII case-insensitive fast path against ignore_whitespace mode

When `ignore_whitespace=true` (regex verbose/`x` mode), the standard
parser treats spaces as insignificant and `#` as a comment marker. The
ASCII case-insensitive fast path bypasses the parser entirely, so it
would incorrectly treat spaces and `#` as literal characters, producing
wrong match semantics. Bail out of the fast path when verbose mode is on.

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The prefilter was gated behind `if self.unicode { return false; }`,
meaning it never activated in the default unicode=true mode. Only 4
ASCII bytes (K, k, S, s) have non-ASCII Unicode case folds, so replace
the blanket guard with a per-byte check. Patterns without those bytes
now get the SIMD literal prefilter with -i, recovering a ~1.75x penalty.
Add reproducible A/B benchmark script for Apple Silicon
regex: allow ASCII case-insensitive prefilter in unicode mode
Remove dead taskpolicy code and unreachable cold-cache check. Move
correctness verification outside the measurement loop to avoid perturbing
cache/thermal state. Fix CSV iter column (was constant, now sample index)
and lines column (was out-of-bounds index, now uses single expected value).
Fix power calculation to use means instead of medians for Cohen's d, and
round U statistic instead of truncating. Add --seed flag for reproducible
benchmark ordering, baseline comparison warnings, CPU isolation note for
Apple Silicon, and adaptive thread counts based on P-core topology.
fix: address 10 QA issues in arm_bench.py
The existing `benchsuite/benchsuite --download linux` command tries to
build the Linux kernel after cloning, which fails on macOS due to
incompatible linker/toolchain. Since arm_bench.py only needs the source
tree for search benchmarks (not a built kernel), this adds a
`--download` flag directly to arm_bench.py that clones without building.

Supports: --download linux, --download subtitles-en, --download all.
Idempotent -- skips corpora that are already present.
fix: add --download to arm_bench.py for macOS corpus setup
bench: tighten arm benchmark reliability controls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants