Skip to content

Fix O(n²) complexity in prepareList#50

Open
heiskr wants to merge 1 commit intosyntax-tree:mainfrom
heiskr:fix/prepare-list-quadratic-complexity
Open

Fix O(n²) complexity in prepareList#50
heiskr wants to merge 1 commit intosyntax-tree:mainfrom
heiskr:fix/prepare-list-quadratic-complexity

Conversation

@heiskr
Copy link
Copy Markdown

@heiskr heiskr commented Apr 9, 2026

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and discussions and couldn't find anything or linked relevant results below
  • I made sure the docs are up to date
  • I included tests (or that's not needed)

Description of changes

prepareList calls events.splice() twice per list item, making it O(n²). This defers insertions into arrays and applies them in a single backward merge pass, making it O(n). Also tightens the backward line-ending scan to stop at start instead of 0.

Fixes #49

Defer events.splice calls and apply them in a single backward
merge pass. Also tighten the backward line-ending scan to stop
at the list start.

Fixes syntax-tree#49

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the 👋 phase/new Post is being triaged automatically label Apr 9, 2026
@github-actions

This comment has been minimized.

@github-actions github-actions Bot added 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Apr 9, 2026
@ChristianMurphy ChristianMurphy added the 🏁 area/perf This affects performance label Apr 9, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves list preprocessing performance by eliminating per-item events.splice() calls in prepareList, reducing complexity from O(n²) to O(n) for large lists.

Changes:

  • Defers listItem enter/exit insertions by collecting insertion positions/events during the walk.
  • Applies all deferred insertions in a single backward merge pass to avoid repeated array shifting.
  • Tightens the backward line-ending scan to stop at the current list’s start index.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@Murderlon Murderlon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also no comments from Devin review. LGTM

Copy link
Copy Markdown
Member

@remcohaszing remcohaszing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m all for it if this is indeed more performant and CI passes.

But I would really appreciate a review from either @ChristianMurphy or @wooorm as I believe micromark is more their area of expertise.

@ChristianMurphy
Copy link
Copy Markdown
Member

Regression tests are interesting here to see the broader impact, running a few scenarios from the commonmark corpus, plus some larger documents to stress test a bit more.
https://github.com/ChristianMurphy/mdast-util-from-markdown/tree/chore/perf-memory-bench
The regression on commonmark general corpus, and regression on smaller lists, and regression on nested lists worries me a bit.
Summary of results run on my machine explained in plain language by Claude:

<claude>

TL;DR

The PR delivers exactly the speedup it advertises on its target case, including a 31.9 % wall-clock reduction on a synthetic stand-in for the GitHub Docs GraphQL reference page (442 ms → 301 ms). On large flat lists the speedup is even larger: 38.5 % on 10 000-item lists.

But the rewrite has measurable cost at smaller scales and on shapes the original code handled better. Nested lists regress 5–15 % across every size measured (100, 500, 1 000, 2 500 items), and the all-of-CommonMark-spec concatenation regresses 8.9 %. Memory is essentially flat: the deferred-merge approach does not blow up heap (heap delta geomean across real-docs is 1.001), and peak RSS differences are at the KB level.

The PR is a clear win for the GitHub Docs use case and other large-list scenarios. Whether it is the right trade for the wider corpus depends on how the maintainers weight "rare-but-bad" against "common-and-mildly-slower."


Headline numbers

Geometric mean of pr / baseline per input class. Values < 1.0 mean the PR is faster / uses less. Heap and peak-RSS columns are at the KB granularity at which process.memoryUsage() reports.

class inputs time heap peak rss
lists 20 0.932 0.967 1.252
pathological 10 0.996 1.002 1.051
real-docs 656 1.004 1.001 1.222
fuzz 100 1.009 1.002

Reading: lists are 6.8 % faster on average. Pathological and real-docs are statistically flat. The peak-RSS column has heavy noise — most small inputs report 0 KB peak (parse fits in already-allocated memory), so the geomean is dominated by a handful of larger inputs and is not a reliable signal at this granularity. Heap delta is the cleaner memory metric and shows no movement.


Where the PR wins

These are inputs with baseline time ≥ 10 ms, where measurement noise is small relative to the effect. Sorted by time ratio (best → worst).

input size base (ms) pr (ms) ratio comment
lists/flat-ordered-10000 158 KB 410.3 252.5 0.615 headline win on the 10 k-item case
lists/flat-unordered-10000 119 KB 392.6 253.4 0.645 same shape
real-docs/gh-docs-reference-6.4k 413 KB 442.4 301.3 0.681 the GitHub Docs scenario from issue #49
lists/flat-ordered-5000 78 KB 174.0 127.7 0.734
lists/paragraph-per-item-100 10 KB 6.6 5.2 0.797 rare small-input win
real-docs/gh-docs-reference-3k 193 KB 176.8 146.1 0.827
lists/flat-unordered-5000 59 KB 150.5 126.3 0.840
lists/flat-ordered-2500 38 KB 84.0 77.4 0.921

The shape of the speedup curve matches the algorithmic claim: the ratio approaches 0 as N grows because the original code is O(n²) and the PR is O(n). At N = 10 000, the splice-shift cost dominates everything else parsing does, which is why the saving is so large.


Where the PR regresses

Same filter (baseline ≥ 10 ms), sorted by ratio worst-first.

input size base (ms) pr (ms) ratio comment
lists/nested-unordered-100 4 KB 13.3 15.4 1.155 nested lists are the consistent regression
lists/flat-ordered-1000 14 KB 30.4 34.0 1.119 flat list at 1 k items got slower
lists/nested-unordered-500 21 KB 58.4 65.0 1.114
pathological/attention-runs-100 40 KB 11.6 12.6 1.089 unrelated code path
real-docs/commonmark-spec/concat 16 KB 35.2 38.4 1.089 full spec concatenated
lists/nested-unordered-2500 112 KB 323.4 346.9 1.073
lists/nested-unordered-1000 43 KB 126.7 133.2 1.052

The pattern across the four nested-list sizes (100 / 500 / 1 000 / 2 500) is the most actionable signal here. Every size regresses, with the ratio holding roughly steady around 1.05–1.15. That means nested lists are not a small-N artifact: the new code is consistently a few percent slower on this shape across all sizes tested.

real-docs/commonmark-spec/concat is also worth attention — it is the closest thing in the corpus to "a real document" rather than a synthetic, and it regresses 8.9 %.


Sub-millisecond noise band (the other 85 regressions)

The pass/fail gate also flagged 85 inputs whose baseline time is below 1 ms — almost entirely individual CommonMark spec examples and tiny fuzz seeds. Distribution of all 92 flagged regressions by baseline time bucket:

baseline time flagged reading
< 1 ms 85 timer noise dominates; even with median-of-9 + p95-required-too, +5 % is ~50 µs at this scale
1–10 ms 0 clean band
≥ 10 ms 7 the table above; real findings

In other words, the gate's noise floor is the sub-millisecond range on this hardware, not the algorithm. A single-digit-percent shift on a 0.2 ms parse is one cache miss. I'd recommend filtering the strict gate to inputs with baseline ≥ 1 ms before treating the count of failures as a quality bar.


Memory profile

Heap delta is the trustworthy memory measurement here; peak RSS is sampled at setImmediate cadence and resolves at KB granularity, so any input that fits comfortably in already-allocated memory reports 0 and the ratio is undefined or noisy.

For inputs large enough to actually grow the heap:

input heap base (KB) heap pr (KB) Δ
lists/flat-ordered-10000 164 865 136 055 −28.8 MB
lists/flat-unordered-10000 135 519 138 150 +2.6 MB
lists/flat-unordered-5000 115 637 67 714 −47.9 MB (from latest run; lists-only run was −48 MB too — repeatable)
real-docs/gh-docs-reference-6.4k 124 282 124 837 +0.5 MB
pathological/nested-blockquotes-500 289 429 289 413 −0.02 MB

The PR is not holding the deferred-insertion arrays as a lasting cost. After GC the resulting parse tree is the same size or smaller; in two of the largest list cases the PR uses less memory than baseline (the deferred-merge version produces less intermediate garbage during parse, which means less max heap usage at GC checkpoints). The "list-class heap geomean = 0.967" headline reflects this.

Peak RSS values for the large list and gh-docs inputs are within ±1 % of baseline, which is below the noise floor of process.memoryUsage().rss sampling.


Methodology

For each (input, impl):

  1. global.gc() (Node started with --expose-gc).
  2. Snapshot heapUsed and rss.
  3. Start a setImmediate-driven peak-RSS sampler.
  4. performance.now()fromMarkdown(text)performance.now().
  5. Stop sampler. Snapshot heap and RSS after.

Runs are interleaved: B P B P … B P (11 of each). Per (input, impl) we drop the highest and lowest, then take median + p95 over the remaining 9.

Hard ceilings per run: 30 s wall-clock, 1 GiB heap delta. None hit on this run.

I did not use Benchmark.js, mitata, or tinybench. Those are tuned for sub-millisecond microbenchmarks where measurement overhead dominates; this workload is in the millisecond-to-second range where wall-clock noise is the constraint, and none of them sample memory mid-run or do paired-impl comparison the way this needed.

The bench harness, including reproduction commands, lives at bench/ in the repo. CSV is at bench/out/latest.csv; the auto-generated full summary is at bench/out/latest.md.


Recommendation

The PR is worth landing for the workload it targets. The 32 % saving on the GitHub Docs page shape (issue #49) is the marquee result, and the 38 % saving on 10 k-item lists confirms the asymptotic claim.

Before landing, I'd want either an answer or a measurement on three things:

  1. Why do nested lists regress? The pattern is consistent across four sizes — that smells like algorithmic, not noise. The new backward-merge pass walks events once more than the original splice approach; on inputs where prepareList is invoked many times (each nesting level triggers it) the constant-factor overhead of building two arrays and doing the merge pass might outweigh the splice savings when each call only has a handful of items to insert.
  2. Why does flat-ordered-1000 regress while flat-ordered-2500 already wins? The crossover point matters. If the PR is a net loss below ~1 500 items and a net win above, that's a reasonable trade for most real workloads. If the regression is shape-specific rather than size-specific, that's worth understanding before merge.
  3. Is commonmark-spec/concat representative? That is the closest thing in this corpus to "real markdown" rather than synthetic shapes. Its 8.9 % regression is small but real. It might be that this concatenation has many small lists, in which case it tells the same story as nested-unordered-100 — many prepareList calls each with few items.

A small follow-up — only invoke the new merge-pass branch when insertCount exceeds some threshold (say 4 or 8) and fall back to the original splice loop otherwise — would likely turn every regression here into a tie while keeping the big-N wins. Worth measuring before requesting it on the PR.

</claude>

ChristianMurphy pushed a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
Problem: prepareList synthesises listItem enter/exit events one at a
time via events.splice(at, 0, [event]), and each splice shifts the
suffix of the events array. For a list with K items inside an array of
N events that is O(K * N) shift work per list — the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown observed at depth on issue syntax-tree#49 / PR syntax-tree#50.

Goal: collapse the per-item splices into a single batched rewrite of
the list's event range, while preserving the existing tail-walk
semantics that determine where each listItem's exit should be inserted.

Changes:
- dev/lib/index.js: replace the inline events.splice calls in
  prepareList with an insertions[] queue collected during the walk.
  After the loop, apply the queued insertions in one pass:
  - small lists (<= 8 insertions, the common case including deeply
    nested lists where many tiny ranges would otherwise pay rebuild
    overhead) splice each insertion in reverse order so unsplice'd
    positions stay valid;
  - wide lists go through a batched newSub rebuild and use a chunked
    spread to avoid V8's argument-count limit when newSub > 5000.

Inputs that benefit (multi-run median-of-medians vs baseline; spread
in parentheses):
- p-wide-list (10000 single-level items): -38.0% (7.9%)
- p-many-headings: -18.3% (3.8%)
- xs (one CommonMark example): -8.3% (7.5%)
- l (~564 KB CommonMark spec * 35): -8.0% (2.6%)
- s (full CommonMark spec): -7.2% (11.3%)
- m (CommonMark spec * 7): -2.5% (3.2%)
- p-deep-list (256 nested levels): -2.4% but spread is 46.5% on this
  stack; treat as inside its own noise band.

Single-run full corpus shows the same direction on every other input
that contains at least one list (p-many-fenced-code -25.6%,
p-many-images -23.2%, p-many-char-refs -23.9%, p-many-links -26.5%,
p-tab-heavy -23.2%, p-html-blocks -17.1%, etc.).

Trade-offs / inputs that do not contain lists:
- legacy-strong / legacy-strong-emph (raw 'a**b' x 1e4 emphasis
  patterns) reported +13% / +28% on a single run, but their input
  contains zero lists so prepareList is never invoked. Cross-run
  spread on these scenarios is 44-52% on the baseline alone; the +/-
  numbers here are noise inside that band.
- p-long-para, p-unicode-heavy, p-mismatched-emph: also list-free; all
  three move within +/-3% of baseline (noise).

Tests: dev + prod 1448/1448. mdast-util-gfm 54/54. mdast-util-mdx
11/13 — the two failing tests reproduce on upstream/main and are not
introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
Octopus merge of the three independent mdast-util-from-markdown perf
branches into one rollup so maintainers can evaluate the cumulative
impact on a single bench run. Each underlying branch is also pushed
on its own and can land independently.

Branches merged:
- perf/prepare-list-no-splice — batch listItem insertions (Closes syntax-tree#49,
  Refs syntax-tree#50)
- perf/dispatch-context-reuse — single shared dispatch this-binding
- perf/stable-node-shape — pre-declare position in every node factory

Cumulative impact (multi-run median-of-medians vs the mdast baseline;
spread in parentheses):
- p-many-char-refs (10k '&amp;' entities): -43.8% (7.5%)
- p-wide-list (10k single-level items): -42.9% (11.6% — borderline)
- l (~564 KB CommonMark spec * 35): -13.8% (1.0% — very clean)
- s (full CommonMark spec): -12.3% (7.9%)
- m (CommonMark spec * 7): -8.5% (1.0% — very clean)
- legacy-base (40 KB 'xxxx' x 1e4): -4.0% (9.0%)

Single-run full corpus shows wins on 20 of 27 scenarios in the range
-1% to -43%. The largest pathological wins are p-many-char-refs
-42.2%, p-many-fenced-code -40.7%, p-wide-list -37.0%, p-tab-heavy
-35.6%, p-many-headings -28.2%, p-code-spans -24.4%, p-many-links
-22.3%, p-many-images -21.5%, p-html-blocks -15.3%, xs -16.4%.

Trade-offs / inputs that do not improve:
- p-long-para: +5.3% multi-run (spread 9.4%) and +11.6% single-run.
  This input is one giant text node; none of the three changes target
  the single-text-node path. The +5.3% is on the edge of its noise
  band but is the largest credible regression of the rollup.
- p-unicode-heavy: +2.3% multi-run (spread 11.4%, NOISY) — within the
  scenario's noise band.
- p-mismatched-emph: +0.8% multi-run (7.3%) — flat.
- legacy-strong / legacy-strong-emph: +44% / +58% single-run, but
  these scenarios have 12-sample mandatory floor on this stack and
  cross-run spread of 44-53% on baseline alone; the deltas reported
  here are inside that band. The input is pure 'a**b' x 1e4 with no
  lists / few node creations / few mid-event sliceSerialize variants,
  so none of the three optimisations target what this scenario
  exercises.

Tests: dev + prod 1448/1448. mdast-util-gfm 54/54. mdast-util-mdx
11/13 — the two failing tests reproduce on upstream/main and are not
introduced by this branch.
ChristianMurphy pushed a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesises listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to 8 insertions take
a fast path that splices each insertion in reverse order so
unsplice'd positions stay valid, which avoids the cost of allocating
a fresh sub-array; longer lists go through a batched rebuild and use
a chunked spread so the splice never hits V8's argument count limit.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -38.0%  (7.9%)
  5,000 ATX headings                 -18.3%  (3.8%)
  one CommonMark example              -8.3%  (7.5%)
  CommonMark spec * 35  (~564 KB)     -8.0%  (2.6%)
  full CommonMark spec  (~16 KB)      -7.2%  (11.3%)
  CommonMark spec * 7  (~113 KB)      -2.5%  (3.2%)
  256 nested ordered list levels      -2.4%  (46.5% spread, treat as
                                              flat on this stack)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesises listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to 8 insertions take
a fast path that splices each insertion in reverse order so
unsplice'd positions stay valid, which avoids the cost of allocating
a fresh sub-array; longer lists go through a batched rebuild and use
a chunked spread so the splice never hits V8's argument count limit.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -38.0%  (7.9%)
  5,000 ATX headings                 -18.3%  (3.8%)
  one CommonMark example              -8.3%  (7.5%)
  CommonMark spec * 35  (~564 KB)     -8.0%  (2.6%)
  full CommonMark spec  (~16 KB)      -7.2%  (11.3%)
  CommonMark spec * 7  (~113 KB)      -2.5%  (3.2%)
  256 nested ordered list levels      -2.4%  (46.5% spread, treat as
                                              flat on this stack)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
Octopus merge of the three independent perf branches into one rollup
so reviewers can evaluate the cumulative impact on a single bench run.
Each underlying branch is also pushed on its own and can land
independently.

Branches merged:
- perf/prepare-list-no-splice  (Closes syntax-tree#49, Refs syntax-tree#50)
- perf/dispatch-context-reuse
- perf/stable-node-shape

Cumulative impact, multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 character entity references     -43.8%  (7.5%)
  10,000 single-level list items         -42.9%  (11.6%, borderline)
  CommonMark spec * 35  (~564 KB)        -13.8%  (1.0%, very clean)
  full CommonMark spec  (~16 KB)         -12.3%  (7.9%)
  CommonMark spec * 7  (~113 KB)          -8.5%  (1.0%, very clean)
  'xxxx' x 10,000  (~40 KB)               -4.0%  (9.0%)

Single-run full corpus shows wins on 20 of 27 inputs, ranging from
-1% to -43%. The largest pathological wins are inputs heavy in
character entity references (-42.2%), fenced code blocks (-40.7%),
single-level list items (-37.0%), tabs (-35.6%), ATX headings
(-28.2%), backtick code spans (-24.4%), inline links (-22.3%),
inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark
example (-16.4%).

Trade-offs:

A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread,
and +11.6% on a single full-corpus run. None of the three changes
target the single-text-node path, so the small regression is the
edge of that input's noise band. A 256 KB Unicode-heavy input
reported +2.3% multi-run inside its 11.4% spread (treat as flat).
A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat).

The pure emphasis stress inputs ('a**b' repeated 10,000 times and
similar) reported +44% and +58% on a single run, but their cross-run
spread is 44 to 53% on the baseline alone. The input shape (almost
all attentionSequence events that mostly do not match a handler, no
lists, no node-creation hot path) means none of the three
optimisations can target what these inputs exercise. Treat the
deltas as noise.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesises listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to 8 insertions take
a fast path that splices each insertion in reverse order so
unsplice'd positions stay valid, which avoids the cost of allocating
a fresh sub-array; longer lists go through a batched rebuild and use
a chunked spread so the splice never hits V8's argument count limit.

Both thresholds were swept on real inputs rather than picked by feel.
SMALL_LIST_LIMIT was tested at {0, 2, 4, 8, 16, 32, 64}: deeper
nesting (a 256-level list) preferred 16 to 64 (around 21% faster than
0), but typical documents (CommonMark spec, spec * 7, spec * 35)
preferred lower values because the rebuild path's allocation +
sort overhead outweighs the saved splice work when lists are 4 to 12
items each. 8 sits at the balance point and keeps the validated
multi-run wins on the typical-document inputs. The chunked-spread
threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000};
10000 was the lowest median across the 10000-item single-level list
and both spec-derived inputs and matches the threshold
micromark-util-chunked already uses for its own splice helper.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -38.0%  (7.9%)
  5,000 ATX headings                 -18.3%  (3.8%)
  one CommonMark example              -8.3%  (7.5%)
  CommonMark spec * 35  (~564 KB)     -8.0%  (2.6%)
  full CommonMark spec  (~16 KB)      -7.2%  (11.3%)
  CommonMark spec * 7  (~113 KB)      -2.5%  (3.2%)
  256 nested ordered list levels      -2.4%  (46.5% spread, treat as
                                              flat on this stack)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
Octopus merge of the three independent perf branches into one rollup
so reviewers can evaluate the cumulative impact on a single bench run.
Each underlying branch is also pushed on its own and can land
independently.

Branches merged:
- perf/prepare-list-no-splice  (Closes syntax-tree#49, Refs syntax-tree#50)
- perf/dispatch-context-reuse
- perf/stable-node-shape

Cumulative impact, multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 character entity references     -43.8%  (7.5%)
  10,000 single-level list items         -42.9%  (11.6%, borderline)
  CommonMark spec * 35  (~564 KB)        -13.8%  (1.0%, very clean)
  full CommonMark spec  (~16 KB)         -12.3%  (7.9%)
  CommonMark spec * 7  (~113 KB)          -8.5%  (1.0%, very clean)
  'xxxx' x 10,000  (~40 KB)               -4.0%  (9.0%)

Single-run full corpus shows wins on 20 of 27 inputs, ranging from
-1% to -43%. The largest pathological wins are inputs heavy in
character entity references (-42.2%), fenced code blocks (-40.7%),
single-level list items (-37.0%), tabs (-35.6%), ATX headings
(-28.2%), backtick code spans (-24.4%), inline links (-22.3%),
inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark
example (-16.4%).

Trade-offs:

A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread,
and +11.6% on a single full-corpus run. None of the three changes
target the single-text-node path, so the small regression is the
edge of that input's noise band. A 256 KB Unicode-heavy input
reported +2.3% multi-run inside its 11.4% spread (treat as flat).
A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat).

The pure emphasis stress inputs ('a**b' repeated 10,000 times and
similar) reported +44% and +58% on a single run, but their cross-run
spread is 44 to 53% on the baseline alone. The input shape (almost
all attentionSequence events that mostly do not match a handler, no
lists, no node-creation hot path) means none of the three
optimisations can target what these inputs exercise. Treat the
deltas as noise.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesises listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to 8 insertions take
a fast path that splices each insertion in reverse order so
unsplice'd positions stay valid, which avoids the cost of allocating
a fresh sub-array; longer lists go through a batched rebuild and use
a chunked spread so the splice never hits V8's argument count limit.

Both thresholds were swept on real inputs rather than picked by feel.
SMALL_LIST_LIMIT was tested at {0, 2, 4, 8, 16, 32, 64}: deeper
nesting (a 256-level list) preferred 16 to 64 (around 21% faster than
0), but typical documents (CommonMark spec, spec * 7, spec * 35)
preferred lower values because the rebuild path's allocation +
sort overhead outweighs the saved splice work when lists are 4 to 12
items each. 8 sits at the balance point and keeps the validated
multi-run wins on the typical-document inputs. The chunked-spread
threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000};
10000 was the lowest median across the 10000-item single-level list
and both spec-derived inputs and matches the threshold
micromark-util-chunked already uses for its own splice helper.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -38.0%  (7.9%)
  5,000 ATX headings                 -18.3%  (3.8%)
  one CommonMark example              -8.3%  (7.5%)
  CommonMark spec * 35  (~564 KB)     -8.0%  (2.6%)
  full CommonMark spec  (~16 KB)      -7.2%  (11.3%)
  CommonMark spec * 7  (~113 KB)      -2.5%  (3.2%)
  256 nested ordered list levels      -2.4%  (46.5% spread, treat as
                                              flat on this stack)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesises listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to 8 insertions take
a fast path that splices each insertion in reverse order so
unsplice'd positions stay valid, which avoids the cost of allocating
a fresh sub-array; longer lists go through a batched rebuild and use
a chunked spread so the splice never hits V8's argument count limit.

Both thresholds were swept on real inputs rather than picked by feel.
SMALL_LIST_LIMIT was tested at {0, 2, 4, 8, 16, 32, 64}: deeper
nesting (a 256-level list) preferred 16 to 64 (around 21% faster than
0), but typical documents (CommonMark spec, spec * 7, spec * 35)
preferred lower values because the rebuild path's allocation +
sort overhead outweighs the saved splice work when lists are 4 to 12
items each. 8 sits at the balance point and keeps the validated
multi-run wins on the typical-document inputs. The chunked-spread
threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000};
10000 was the lowest median across the 10000-item single-level list
and both spec-derived inputs and matches the threshold
micromark-util-chunked already uses for its own splice helper.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -38.0%  (7.9%)
  5,000 ATX headings                 -18.3%  (3.8%)
  one CommonMark example              -8.3%  (7.5%)
  CommonMark spec * 35  (~564 KB)     -8.0%  (2.6%)
  full CommonMark spec  (~16 KB)      -7.2%  (11.3%)
  CommonMark spec * 7  (~113 KB)      -2.5%  (3.2%)
  256 nested ordered list levels      -2.4%  (46.5% spread, treat as
                                              flat on this stack)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
Octopus merge of the three independent perf branches into one rollup
so reviewers can evaluate the cumulative impact on a single bench run.
Each underlying branch is also pushed on its own and can land
independently.

Branches merged:
- perf/prepare-list-no-splice  (Closes syntax-tree#49, Refs syntax-tree#50)
- perf/dispatch-context-reuse
- perf/stable-node-shape

Cumulative impact, multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 character entity references     -43.8%  (7.5%)
  10,000 single-level list items         -42.9%  (11.6%, borderline)
  CommonMark spec * 35  (~564 KB)        -13.8%  (1.0%, very clean)
  full CommonMark spec  (~16 KB)         -12.3%  (7.9%)
  CommonMark spec * 7  (~113 KB)          -8.5%  (1.0%, very clean)
  'xxxx' x 10,000  (~40 KB)               -4.0%  (9.0%)

Single-run full corpus shows wins on 20 of 27 inputs, ranging from
-1% to -43%. The largest pathological wins are inputs heavy in
character entity references (-42.2%), fenced code blocks (-40.7%),
single-level list items (-37.0%), tabs (-35.6%), ATX headings
(-28.2%), backtick code spans (-24.4%), inline links (-22.3%),
inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark
example (-16.4%).

Trade-offs:

A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread,
and +11.6% on a single full-corpus run. None of the three changes
target the single-text-node path, so the small regression is the
edge of that input's noise band. A 256 KB Unicode-heavy input
reported +2.3% multi-run inside its 11.4% spread (treat as flat).
A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat).

The pure emphasis stress inputs ('a**b' repeated 10,000 times and
similar) reported +44% and +58% on a single run, but their cross-run
spread is 44 to 53% on the baseline alone. The input shape (almost
all attentionSequence events that mostly do not match a handler, no
lists, no node-creation hot path) means none of the three
optimizations can target what these inputs exercise. Treat the
deltas as noise.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesizes listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to a small number of
insertions take a fast path that splices each insertion in reverse
order so unsplice'd positions stay valid, which avoids the cost of
allocating a fresh sub-array; longer lists go through a batched
rebuild and use a chunked spread so the splice never hits V8's
argument count limit.

How the cut points were chosen:

There are two thresholds in the new code: SMALL_LIST_LIMIT chooses
between fast-path splice loop and rebuild; SAFE_SPREAD chooses
between a single spread and a chunked spread.

SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path
costs O(K * suffix) because each of the K splices shifts the events
suffix; the rebuild path costs O(N + K) plus a fixed allocation and
sort overhead. Below some K the rebuild's constant overhead dominates;
above some K the fast path's K * suffix dominates. Because suffix
size and per-insertion splice cost both vary with document shape,
no single value is universally best: deeper nesting prefers higher
limits (everything stays on the splice loop), and documents with a
few moderately-sized lists prefer lower limits (the rebuild's lower
per-item cost wins). The threshold was chosen by sweeping
{0, 2, 4, 8, 16, 32, 64} with representative inputs from both
regimes and picking the value that kept the validated multi-run wins
on typical-document inputs without regressing the deep-nest case
beyond its own noise band.

SAFE_SPREAD is set by V8's argument count limit. Spreading a very
large array into events.splice can throw a stack overflow in some V8
versions, so the rebuild splits the new sub-array into chunks once
it exceeds a safe size. The chunk threshold was tested at
{1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median
across the wide-list and spec-derived inputs and matches the
threshold micromark-util-chunked already uses for its own splice
helper, which tracks the same V8 constraint.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -38.0%  (7.9%)
  5,000 ATX headings                 -18.3%  (3.8%)
  one CommonMark example              -8.3%  (7.5%)
  CommonMark spec * 35  (~564 KB)     -8.0%  (2.6%)
  full CommonMark spec  (~16 KB)      -7.2%  (11.3%)
  CommonMark spec * 7  (~113 KB)      -2.5%  (3.2%)
  256 nested ordered list levels      -2.4%  (46.5% spread, treat as
                                              flat on this stack)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
Octopus merge of the three independent perf branches into one rollup
so reviewers can evaluate the cumulative impact on a single bench run.
Each underlying branch is also pushed on its own and can land
independently.

Branches merged:
- perf/prepare-list-no-splice  (Closes syntax-tree#49, Refs syntax-tree#50)
- perf/dispatch-context-reuse
- perf/stable-node-shape

Cumulative impact, multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 character entity references     -43.8%  (7.5%)
  10,000 single-level list items         -42.9%  (11.6%, borderline)
  CommonMark spec * 35  (~564 KB)        -13.8%  (1.0%, very clean)
  full CommonMark spec  (~16 KB)         -12.3%  (7.9%)
  CommonMark spec * 7  (~113 KB)          -8.5%  (1.0%, very clean)
  'xxxx' x 10,000  (~40 KB)               -4.0%  (9.0%)

Single-run full corpus shows wins on 20 of 27 inputs, ranging from
-1% to -43%. The largest pathological wins are inputs heavy in
character entity references (-42.2%), fenced code blocks (-40.7%),
single-level list items (-37.0%), tabs (-35.6%), ATX headings
(-28.2%), backtick code spans (-24.4%), inline links (-22.3%),
inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark
example (-16.4%).

Trade-offs:

A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread,
and +11.6% on a single full-corpus run. None of the three changes
target the single-text-node path, so the small regression is the
edge of that input's noise band. A 256 KB Unicode-heavy input
reported +2.3% multi-run inside its 11.4% spread (treat as flat).
A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat).

The pure emphasis stress inputs ('a**b' repeated 10,000 times and
similar) reported +44% and +58% on a single run, but their cross-run
spread is 44 to 53% on the baseline alone. The input shape (almost
all attentionSequence events that mostly do not match a handler, no
lists, no node-creation hot path) means none of the three
optimizations can target what these inputs exercise. Treat the
deltas as noise.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.
@ChristianMurphy
Copy link
Copy Markdown
Member

I opened a follow up at #51 that addresses the small list and nested list slowdown, while keeping the speed up for large lists

ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesizes listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to a small number of
insertions take a fast path that splices each insertion in reverse
order so unsplice'd positions stay valid, which avoids the cost of
allocating a fresh sub-array; longer lists go through a batched
rebuild and use a chunked spread so the splice never hits V8's
argument count limit.

How the cut points were chosen:

There are two thresholds in the new code: SMALL_LIST_LIMIT chooses
between fast-path splice loop and rebuild; SAFE_SPREAD chooses
between a single spread and a chunked spread.

SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path
costs O(K * suffix) because each of the K splices shifts the events
suffix; the rebuild path costs O(N + K) plus a fixed allocation and
sort overhead. Below some K the rebuild's constant overhead dominates;
above some K the fast path's K * suffix dominates. Because suffix
size and per-insertion splice cost both vary with document shape,
no single value is universally best: deeper nesting prefers higher
limits (everything stays on the splice loop), and documents with a
few moderately-sized lists prefer lower limits (the rebuild's lower
per-item cost wins). The threshold was chosen by sweeping
{0, 2, 4, 8, 16, 32, 64} with representative inputs from both
regimes and picking the value that kept the validated multi-run wins
on typical-document inputs without regressing the deep-nest case
beyond its own noise band.

SAFE_SPREAD is set by V8's argument count limit. Spreading a very
large array into events.splice can throw a stack overflow in some V8
versions, so the rebuild splits the new sub-array into chunks once
it exceeds a safe size. The chunk threshold was tested at
{1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median
across the wide-list and spec-derived inputs and matches the
threshold micromark-util-chunked already uses for its own splice
helper, which tracks the same V8 constraint.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -38.0%  (7.9%)
  5,000 ATX headings                 -18.3%  (3.8%)
  one CommonMark example              -8.3%  (7.5%)
  CommonMark spec * 35  (~564 KB)     -8.0%  (2.6%)
  full CommonMark spec  (~16 KB)      -7.2%  (11.3%)
  CommonMark spec * 7  (~113 KB)      -2.5%  (3.2%)
  256 nested ordered list levels      -2.4%  (46.5% spread, treat as
                                              flat on this stack)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
Octopus merge of the three independent perf branches into one rollup
so reviewers can evaluate the cumulative impact on a single bench run.
Each underlying branch is also pushed on its own and can land
independently.

Branches merged:
- perf/prepare-list-no-splice  (Closes syntax-tree#49, Refs syntax-tree#50)
- perf/dispatch-context-reuse
- perf/stable-node-shape

Cumulative impact, multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 character entity references     -43.8%  (7.5%)
  10,000 single-level list items         -42.9%  (11.6%, borderline)
  CommonMark spec * 35  (~564 KB)        -13.8%  (1.0%, very clean)
  full CommonMark spec  (~16 KB)         -12.3%  (7.9%)
  CommonMark spec * 7  (~113 KB)          -8.5%  (1.0%, very clean)
  'xxxx' x 10,000  (~40 KB)               -4.0%  (9.0%)

Single-run full corpus shows wins on 20 of 27 inputs, ranging from
-1% to -43%. The largest pathological wins are inputs heavy in
character entity references (-42.2%), fenced code blocks (-40.7%),
single-level list items (-37.0%), tabs (-35.6%), ATX headings
(-28.2%), backtick code spans (-24.4%), inline links (-22.3%),
inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark
example (-16.4%).

Trade-offs:

A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread,
and +11.6% on a single full-corpus run. None of the three changes
target the single-text-node path, so the small regression is the
edge of that input's noise band. A 256 KB Unicode-heavy input
reported +2.3% multi-run inside its 11.4% spread (treat as flat).
A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat).

The pure emphasis stress inputs ('a**b' repeated 10,000 times and
similar) reported +44% and +58% on a single run, but their cross-run
spread is 44 to 53% on the baseline alone. The input shape (almost
all attentionSequence events that mostly do not match a handler, no
lists, no node-creation hot path) means none of the three
optimizations can target what these inputs exercise. Treat the
deltas as noise.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesizes listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to a small number of
insertions take a fast path that splices each insertion in reverse
order so unsplice'd positions stay valid, which avoids the cost of
allocating a fresh sub-array; longer lists go through a batched
rebuild and use a chunked spread so the splice never hits V8's
argument count limit.

How the cut points were chosen:

There are two thresholds in the new code: SMALL_LIST_LIMIT chooses
between fast-path splice loop and rebuild; SAFE_SPREAD chooses
between a single spread and a chunked spread.

SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path
costs O(K * suffix) because each of the K splices shifts the events
suffix; the rebuild path costs O(N + K) plus a fixed allocation and
sort overhead. Below some K the rebuild's constant overhead dominates;
above some K the fast path's K * suffix dominates. Because suffix
size and per-insertion splice cost both vary with document shape,
no single value is universally best: deeper nesting prefers higher
limits (everything stays on the splice loop), and documents with a
few moderately-sized lists prefer lower limits (the rebuild's lower
per-item cost wins). The threshold was chosen by sweeping
{0, 2, 4, 8, 16, 32, 64} with representative inputs from both
regimes and picking the value that kept the validated multi-run wins
on typical-document inputs without regressing the deep-nest case
beyond its own noise band.

SAFE_SPREAD is set by V8's argument count limit. Spreading a very
large array into events.splice can throw a stack overflow in some V8
versions, so the rebuild splits the new sub-array into chunks once
it exceeds a safe size. The chunk threshold was tested at
{1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median
across the wide-list and spec-derived inputs and matches the
threshold micromark-util-chunked already uses for its own splice
helper, which tracks the same V8 constraint.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -38.0%  (7.9%)
  5,000 ATX headings                 -18.3%  (3.8%)
  one CommonMark example              -8.3%  (7.5%)
  CommonMark spec * 35  (~564 KB)     -8.0%  (2.6%)
  full CommonMark spec  (~16 KB)      -7.2%  (11.3%)
  CommonMark spec * 7  (~113 KB)      -2.5%  (3.2%)
  256 nested ordered list levels      -2.4%  (46.5% spread, treat as
                                              flat on this stack)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
Octopus merge of the three independent perf branches into one rollup
so reviewers can evaluate the cumulative impact on a single bench run.
Each underlying branch is also pushed on its own and can land
independently.

Branches merged:
- perf/prepare-list-no-splice  (Closes syntax-tree#49, Refs syntax-tree#50)
- perf/dispatch-context-reuse
- perf/stable-node-shape

Cumulative impact, multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 character entity references     -43.8%  (7.5%)
  10,000 single-level list items         -42.9%  (11.6%, borderline)
  CommonMark spec * 35  (~564 KB)        -13.8%  (1.0%, very clean)
  full CommonMark spec  (~16 KB)         -12.3%  (7.9%)
  CommonMark spec * 7  (~113 KB)          -8.5%  (1.0%, very clean)
  'xxxx' x 10,000  (~40 KB)               -4.0%  (9.0%)

Single-run full corpus shows wins on 20 of 27 inputs, ranging from
-1% to -43%. The largest pathological wins are inputs heavy in
character entity references (-42.2%), fenced code blocks (-40.7%),
single-level list items (-37.0%), tabs (-35.6%), ATX headings
(-28.2%), backtick code spans (-24.4%), inline links (-22.3%),
inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark
example (-16.4%).

Trade-offs:

A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread,
and +11.6% on a single full-corpus run. None of the three changes
target the single-text-node path, so the small regression is the
edge of that input's noise band. A 256 KB Unicode-heavy input
reported +2.3% multi-run inside its 11.4% spread (treat as flat).
A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat).

The pure emphasis stress inputs ('a**b' repeated 10,000 times and
similar) reported +44% and +58% on a single run, but their cross-run
spread is 44 to 53% on the baseline alone. The input shape (almost
all attentionSequence events that mostly do not match a handler, no
lists, no node-creation hot path) means none of the three
optimizations can target what these inputs exercise. Treat the
deltas as noise.

Tests pass: dev + prod 1450/1450, 100% coverage maintained.
mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests
reproduce on upstream/main and are not introduced by this branch.
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesizes listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to a small number of
insertions take a fast path that splices each insertion in reverse
order so unsplice'd positions stay valid, which avoids the cost of
allocating a fresh sub-array; longer lists go through a batched
rebuild and use a chunked spread so the splice never hits V8's
argument count limit.

How the cut points were chosen:

There are two thresholds in the new code: SMALL_LIST_LIMIT chooses
between fast-path splice loop and rebuild; SAFE_SPREAD chooses
between a single spread and a chunked spread.

SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path
costs O(K * suffix) because each of the K splices shifts the events
suffix; the rebuild path costs O(N + K) plus a fixed allocation and
sort overhead. Below some K the rebuild's constant overhead dominates;
above some K the fast path's K * suffix dominates. Because suffix
size and per-insertion splice cost both vary with document shape,
no single value is universally best: deeper nesting prefers higher
limits (everything stays on the splice loop), and documents with a
few moderately-sized lists prefer lower limits (the rebuild's lower
per-item cost wins). The threshold was chosen by sweeping
{0, 2, 4, 8, 16, 32, 64} with representative inputs from both
regimes and picking the value that kept the validated multi-run wins
on typical-document inputs without regressing the deep-nest case
beyond its own noise band.

SAFE_SPREAD is set by V8's argument count limit. Spreading a very
large array into events.splice can throw a stack overflow in some V8
versions, so the rebuild splits the new sub-array into chunks once
it exceeds a safe size. The chunk threshold was tested at
{1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median
across the wide-list and spec-derived inputs and matches the
threshold micromark-util-chunked already uses for its own splice
helper, which tracks the same V8 constraint.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -38.0%  (7.9%)
  5,000 ATX headings                 -18.3%  (3.8%)
  one CommonMark example              -8.3%  (7.5%)
  CommonMark spec * 35  (~564 KB)     -8.0%  (2.6%)
  full CommonMark spec  (~16 KB)      -7.2%  (11.3%)
  CommonMark spec * 7  (~113 KB)      -2.5%  (3.2%)
  256 nested ordered list levels      -2.4%  (46.5% spread, treat as
                                              flat on this stack)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesizes listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to a small number of
insertions take a fast path that splices each insertion in reverse
order so unsplice'd positions stay valid, which avoids the cost of
allocating a fresh sub-array; longer lists go through a batched
rebuild and use a chunked spread so the splice never hits V8's
argument count limit.

How the cut points were chosen:

There are two thresholds in the new code: SMALL_LIST_LIMIT chooses
between fast-path splice loop and rebuild; SAFE_SPREAD chooses
between a single spread and a chunked spread.

SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path
costs O(K * suffix) because each of the K splices shifts the events
suffix; the rebuild path costs O(N + K) plus a fixed allocation and
sort overhead. Below some K the rebuild's constant overhead dominates;
above some K the fast path's K * suffix dominates. Because suffix
size and per-insertion splice cost both vary with document shape,
no single value is universally best: deeper nesting prefers higher
limits (everything stays on the splice loop), and documents with a
few moderately-sized lists prefer lower limits (the rebuild's lower
per-item cost wins). The threshold was chosen by sweeping
{0, 2, 4, 8, 16, 32, 64} with representative inputs from both
regimes and picking the value that kept the validated multi-run wins
on typical-document inputs without regressing the deep-nest case
beyond its own noise band.

SAFE_SPREAD is set by V8's argument count limit. Spreading a very
large array into events.splice can throw a stack overflow in some V8
versions, so the rebuild splits the new sub-array into chunks once
it exceeds a safe size. The chunk threshold was tested at
{1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median
across the wide-list and spec-derived inputs and matches the
threshold micromark-util-chunked already uses for its own splice
helper, which tracks the same V8 constraint.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -38.0%  (7.9%)
  5,000 ATX headings                 -18.3%  (3.8%)
  one CommonMark example              -8.3%  (7.5%)
  CommonMark spec * 35  (~564 KB)     -8.0%  (2.6%)
  full CommonMark spec  (~16 KB)      -7.2%  (11.3%)
  CommonMark spec * 7  (~113 KB)      -2.5%  (3.2%)
  256 nested ordered list levels      -2.4%  (46.5% spread, treat as
                                              flat on this stack)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54,
mdast-util-mdx 11/13. The two failing mdx tests reproduce on
upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
Octopus merge of the three independent perf branches into one rollup
so reviewers can evaluate the cumulative impact on a single bench run.
Each underlying branch is also pushed on its own and can land
independently.

Branches merged:
- perf/prepare-list-no-splice  (Closes syntax-tree#49, Refs syntax-tree#50)
- perf/dispatch-context-reuse
- perf/stable-node-shape

Cumulative impact, multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 character entity references     -43.8%  (7.5%)
  10,000 single-level list items         -42.9%  (11.6%, borderline)
  CommonMark spec * 35  (~564 KB)        -13.8%  (1.0%, very clean)
  full CommonMark spec  (~16 KB)         -12.3%  (7.9%)
  CommonMark spec * 7  (~113 KB)          -8.5%  (1.0%, very clean)
  'xxxx' x 10,000  (~40 KB)               -4.0%  (9.0%)

Single-run full corpus shows wins on 20 of 27 inputs, ranging from
-1% to -43%. The largest pathological wins are inputs heavy in
character entity references (-42.2%), fenced code blocks (-40.7%),
single-level list items (-37.0%), tabs (-35.6%), ATX headings
(-28.2%), backtick code spans (-24.4%), inline links (-22.3%),
inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark
example (-16.4%).

Trade-offs:

A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread,
and +11.6% on a single full-corpus run. None of the three changes
target the single-text-node path, so the small regression is the
edge of that input's noise band. A 256 KB Unicode-heavy input
reported +2.3% multi-run inside its 11.4% spread (treat as flat).
A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat).

The pure emphasis stress inputs ('a**b' repeated 10,000 times and
similar) reported +44% and +58% on a single run, but their cross-run
spread is 44 to 53% on the baseline alone. The input shape (almost
all attentionSequence events that mostly do not match a handler, no
lists, no node-creation hot path) means none of the three
optimizations can target what these inputs exercise. Treat the
deltas as noise.

Tests pass: dev + prod 1450/1450, 100% coverage maintained.
mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests
reproduce on upstream/main and are not introduced by this branch.
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesizes listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to a small number of
insertions take a fast path that splices each insertion in reverse
order so unspliced positions stay valid, which avoids the cost of
allocating a fresh sub-array; longer lists go through a batched
rebuild and use a chunked spread so the splice never hits V8's
argument count limit.

How the cut points were chosen:

There are two thresholds in the new code: SMALL_LIST_LIMIT chooses
between fast-path splice loop and rebuild; SAFE_SPREAD chooses
between a single spread and a chunked spread.

SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path
costs O(K * suffix) because each of the K splices shifts the events
suffix; the rebuild path costs O(N + K) plus a fixed allocation
overhead. Below some K the rebuild's constant overhead dominates;
above some K the fast path's K * suffix dominates. Because suffix
size and per-insertion splice cost both vary with document shape,
no single value is universally best: deeper nesting prefers higher
limits (everything stays on the splice loop), and documents with a
few moderately-sized lists prefer lower limits (the rebuild's lower
per-item cost wins). The threshold was chosen by sweeping
{0, 2, 4, 8, 16, 32, 64} with representative inputs from both
regimes and picking the value that kept the validated multi-run wins
on typical-document inputs without regressing the deep-nest case
beyond its own noise band.

SAFE_SPREAD is set by V8's argument count limit. Spreading a very
large array into events.splice can throw a stack overflow in some V8
versions, so the rebuild splits the new sub-array into chunks once
it exceeds a safe size. The chunk threshold was tested at
{1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median
across the wide-list and spec-derived inputs and matches the
threshold micromark-util-chunked already uses for its own splice
helper, which tracks the same V8 constraint.

Pass-1 records insertions in non-decreasing `at` order by
construction (each boundary records its exit insertion at
`lineIndex || index` followed by its enter insertion at `index`, and
the next boundary's tail walk is bounded by the previous boundary's
listItemPrefix), so no sort is needed in the slow path. A dev-only
assertion verifies the invariant on every slow-path call.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -36.3%  (4.0%)
  5,000 ATX headings                 -22.9%  (5.5%)
  one CommonMark example             -13.2%  (16.0%)
  full CommonMark spec  (~16 KB)     -13.0%  (46.4%, NOISY)
  CommonMark spec * 35  (~564 KB)    -10.7%  (0.8%, very clean)
  256 nested ordered list levels      -4.0%  (6.7%)
  CommonMark spec * 7  (~113 KB)      -3.8%  (12.8%)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported +13% and +28% on a
single run, but those inputs have a cross-run spread of 44 to 52% on
the baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1450/1450, 100% coverage maintained.
mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests
reproduce on upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesizes listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to a small number of
insertions take a fast path that splices each insertion in reverse
order so unspliced positions stay valid, which avoids the cost of
allocating a fresh sub-array; longer lists go through a batched
rebuild that writes the replacement into a fresh array, then either
splices the whole replacement in one call (when it fits below V8's
spread argument limit) or shifts the suffix once and writes the
replacement into the vacated range. The in-place shift avoids the
per-chunk splice loop a chunked spread fallback would use, which
would re-introduce O(K * N) shift cost on very wide lists.

How the cut points were chosen:

There are two thresholds in the new code: SMALL_LIST_LIMIT chooses
between fast-path splice loop and rebuild; SAFE_SPREAD chooses
between a single spread and an in-place shift.

SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path
costs O(K * suffix) because each of the K splices shifts the events
suffix; the rebuild path costs O(N + K) plus a fixed allocation
overhead. Below some K the rebuild's constant overhead dominates;
above some K the fast path's K * suffix dominates. Because suffix
size and per-insertion splice cost both vary with document shape,
no single value is universally best: deeper nesting prefers higher
limits (everything stays on the splice loop), and documents with a
few moderately-sized lists prefer lower limits (the rebuild's lower
per-item cost wins). The threshold was chosen by sweeping
{0, 2, 4, 8, 16, 32, 64} with representative inputs from both
regimes and picking the value that kept the validated multi-run wins
on typical-document inputs without regressing the deep-nest case
beyond its own noise band.

SAFE_SPREAD is set by V8's argument count limit. Spreading a very
large array into events.splice can throw a stack overflow in some
V8 versions, so above the threshold the rebuild instead resizes
events once, shifts the suffix to its target position, and writes
the replacement into the vacated range. Total work is O(suffix +
replacement.length), independent of how many insertions were
queued. The single-spread threshold was tested at
{1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median
across the wide-list and spec-derived inputs and matches the
threshold micromark-util-chunked already uses for its own splice
helper.

Pass-1 records insertions in non-decreasing `at` order by
construction (each boundary records its exit insertion at
`lineIndex || index` followed by its enter insertion at `index`,
and the next boundary's tail walk is bounded by the previous
boundary's listItemPrefix), so no sort is needed in the slow path.
A dev-only assertion verifies the invariant on every slow-path
call.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -40.8%  (5.6%)
  5,000 ATX headings                 -20.6%  (4.3%)
  CommonMark spec * 35  (~564 KB)    -11.7%  (1.6%, very clean)
  one CommonMark example              -9.1%  (105%, NOISY)
  256 nested ordered list levels      -8.8%  (20.3%)
  full CommonMark spec  (~16 KB)      -7.0%  (35.4%, NOISY)
  CommonMark spec * 7  (~113 KB)      -3.6%  (1.0%, very clean)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported large deltas in single
runs, but those inputs have a cross-run spread of 44 to 52% on the
baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1452/1452, 100% coverage maintained.
mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests
reproduce on upstream/main and are not introduced by this branch.

Closes syntax-tree#49
Refs syntax-tree#50
ChristianMurphy added a commit to ChristianMurphy/mdast-util-from-markdown that referenced this pull request May 3, 2026
prepareList synthesizes listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to a small number of
insertions take a fast path that splices each insertion in reverse
order so unspliced positions stay valid, which avoids the cost of
allocating a fresh sub-array; longer lists go through a batched
rebuild that writes the replacement into a fresh array, then either
splices the whole replacement in one call (when it fits below V8's
spread argument limit) or shifts the suffix once and writes the
replacement into the vacated range. The in-place shift avoids the
per-chunk splice loop a chunked spread fallback would use, which
would re-introduce O(K * N) shift cost on very wide lists.

How the cut points were chosen:

There are two thresholds in the new code: SMALL_LIST_LIMIT chooses
between fast-path splice loop and rebuild; SAFE_SPREAD chooses
between a single spread and an in-place shift.

SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path
costs O(K * suffix) because each of the K splices shifts the events
suffix; the rebuild path costs O(N + K) plus a fixed allocation
overhead. Below some K the rebuild's constant overhead dominates;
above some K the fast path's K * suffix dominates. Because suffix
size and per-insertion splice cost both vary with document shape,
no single value is universally best: deeper nesting prefers higher
limits (everything stays on the splice loop), and documents with a
few moderately-sized lists prefer lower limits (the rebuild's lower
per-item cost wins). The threshold was chosen by sweeping
{0, 2, 4, 8, 16, 32, 64} with representative inputs from both
regimes and picking the value that kept the validated multi-run wins
on typical-document inputs without regressing the deep-nest case
beyond its own noise band.

SAFE_SPREAD is set by V8's argument count limit. Spreading a very
large array into events.splice can throw a stack overflow in some
V8 versions, so above the threshold the rebuild instead resizes
events once, shifts the suffix to its target position, and writes
the replacement into the vacated range. Total work is O(suffix +
replacement.length), independent of how many insertions were
queued. The single-spread threshold was tested at
{1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median
across the wide-list and spec-derived inputs and matches the
threshold micromark-util-chunked already uses for its own splice
helper.

Pass-1 records insertions in non-decreasing `at` order by
construction (each boundary records its exit insertion at
`lineIndex || index` followed by its enter insertion at `index`,
and the next boundary's tail walk is bounded by the previous
boundary's listItemPrefix), so no sort is needed in the slow path.
A dev-only assertion verifies the invariant on every slow-path
call.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -40.8%  (5.6%)
  5,000 ATX headings                 -20.6%  (4.3%)
  CommonMark spec * 35  (~564 KB)    -11.7%  (1.6%, very clean)
  one CommonMark example              -9.1%  (105%, NOISY)
  256 nested ordered list levels      -8.8%  (20.3%)
  full CommonMark spec  (~16 KB)      -7.0%  (35.4%, NOISY)
  CommonMark spec * 7  (~113 KB)      -3.6%  (1.0%, very clean)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported large deltas in single
runs, but those inputs have a cross-run spread of 44 to 52% on the
baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1454/1454, 100% coverage maintained.
Three new tests cover the rebuild path: a wide-list parse-and-line
spot-check, plus first-item deepEqual against a 4-item fast-path
reference for both tight and loose lists (so a bug confined to the
rebuild branch diverges from the fast path the reference uses).
mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx
tests reproduce on upstream/main and are not introduced by this
branch.

Closes syntax-tree#49
Refs syntax-tree#50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🏁 area/perf This affects performance 🤞 phase/open Post is being triaged manually

Development

Successfully merging this pull request may close these issues.

prepareList has O(n^2) complexity from per-item events.splice

5 participants