Fix O(n²) complexity in prepareList#50
Conversation
Defer events.splice calls and apply them in a single backward merge pass. Also tighten the backward line-ending scan to stop at the list start. Fixes syntax-tree#49 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Improves list preprocessing performance by eliminating per-item events.splice() calls in prepareList, reducing complexity from O(n²) to O(n) for large lists.
Changes:
- Defers
listItementer/exit insertions by collecting insertion positions/events during the walk. - Applies all deferred insertions in a single backward merge pass to avoid repeated array shifting.
- Tightens the backward line-ending scan to stop at the current list’s
startindex.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Murderlon
left a comment
There was a problem hiding this comment.
Also no comments from Devin review. LGTM
remcohaszing
left a comment
There was a problem hiding this comment.
I’m all for it if this is indeed more performant and CI passes.
But I would really appreciate a review from either @ChristianMurphy or @wooorm as I believe micromark is more their area of expertise.
|
Regression tests are interesting here to see the broader impact, running a few scenarios from the commonmark corpus, plus some larger documents to stress test a bit more. <claude> TL;DRThe PR delivers exactly the speedup it advertises on its target case, including a 31.9 % wall-clock reduction on a synthetic stand-in for the GitHub Docs GraphQL reference page (442 ms → 301 ms). On large flat lists the speedup is even larger: 38.5 % on 10 000-item lists. But the rewrite has measurable cost at smaller scales and on shapes the original code handled better. Nested lists regress 5–15 % across every size measured (100, 500, 1 000, 2 500 items), and the all-of-CommonMark-spec concatenation regresses 8.9 %. Memory is essentially flat: the deferred-merge approach does not blow up heap (heap delta geomean across real-docs is 1.001), and peak RSS differences are at the KB level. The PR is a clear win for the GitHub Docs use case and other large-list scenarios. Whether it is the right trade for the wider corpus depends on how the maintainers weight "rare-but-bad" against "common-and-mildly-slower." Headline numbersGeometric mean of
Reading: lists are 6.8 % faster on average. Pathological and real-docs are statistically flat. The peak-RSS column has heavy noise — most small inputs report 0 KB peak (parse fits in already-allocated memory), so the geomean is dominated by a handful of larger inputs and is not a reliable signal at this granularity. Heap delta is the cleaner memory metric and shows no movement. Where the PR winsThese are inputs with baseline time ≥ 10 ms, where measurement noise is small relative to the effect. Sorted by time ratio (best → worst).
The shape of the speedup curve matches the algorithmic claim: the ratio approaches 0 as N grows because the original code is Where the PR regressesSame filter (baseline ≥ 10 ms), sorted by ratio worst-first.
The pattern across the four nested-list sizes (100 / 500 / 1 000 / 2 500) is the most actionable signal here. Every size regresses, with the ratio holding roughly steady around 1.05–1.15. That means nested lists are not a small-N artifact: the new code is consistently a few percent slower on this shape across all sizes tested.
Sub-millisecond noise band (the other 85 regressions)The pass/fail gate also flagged 85 inputs whose baseline time is below 1 ms — almost entirely individual CommonMark spec examples and tiny fuzz seeds. Distribution of all 92 flagged regressions by baseline time bucket:
In other words, the gate's noise floor is the sub-millisecond range on this hardware, not the algorithm. A single-digit-percent shift on a 0.2 ms parse is one cache miss. I'd recommend filtering the strict gate to inputs with baseline ≥ 1 ms before treating the count of failures as a quality bar. Memory profileHeap delta is the trustworthy memory measurement here; peak RSS is sampled at For inputs large enough to actually grow the heap:
The PR is not holding the deferred-insertion arrays as a lasting cost. After GC the resulting parse tree is the same size or smaller; in two of the largest list cases the PR uses less memory than baseline (the deferred-merge version produces less intermediate garbage during parse, which means less max heap usage at GC checkpoints). The "list-class heap geomean = 0.967" headline reflects this. Peak RSS values for the large list and gh-docs inputs are within ±1 % of baseline, which is below the noise floor of MethodologyFor each
Runs are interleaved: B P B P … B P (11 of each). Per Hard ceilings per run: 30 s wall-clock, 1 GiB heap delta. None hit on this run. I did not use Benchmark.js, mitata, or tinybench. Those are tuned for sub-millisecond microbenchmarks where measurement overhead dominates; this workload is in the millisecond-to-second range where wall-clock noise is the constraint, and none of them sample memory mid-run or do paired-impl comparison the way this needed. The bench harness, including reproduction commands, lives at RecommendationThe PR is worth landing for the workload it targets. The 32 % saving on the GitHub Docs page shape (issue #49) is the marquee result, and the 38 % saving on 10 k-item lists confirms the asymptotic claim. Before landing, I'd want either an answer or a measurement on three things:
A small follow-up — only invoke the new merge-pass branch when </claude> |
Problem: prepareList synthesises listItem enter/exit events one at a time via events.splice(at, 0, [event]), and each splice shifts the suffix of the events array. For a list with K items inside an array of N events that is O(K * N) shift work per list — the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown observed at depth on issue syntax-tree#49 / PR syntax-tree#50. Goal: collapse the per-item splices into a single batched rewrite of the list's event range, while preserving the existing tail-walk semantics that determine where each listItem's exit should be inserted. Changes: - dev/lib/index.js: replace the inline events.splice calls in prepareList with an insertions[] queue collected during the walk. After the loop, apply the queued insertions in one pass: - small lists (<= 8 insertions, the common case including deeply nested lists where many tiny ranges would otherwise pay rebuild overhead) splice each insertion in reverse order so unsplice'd positions stay valid; - wide lists go through a batched newSub rebuild and use a chunked spread to avoid V8's argument-count limit when newSub > 5000. Inputs that benefit (multi-run median-of-medians vs baseline; spread in parentheses): - p-wide-list (10000 single-level items): -38.0% (7.9%) - p-many-headings: -18.3% (3.8%) - xs (one CommonMark example): -8.3% (7.5%) - l (~564 KB CommonMark spec * 35): -8.0% (2.6%) - s (full CommonMark spec): -7.2% (11.3%) - m (CommonMark spec * 7): -2.5% (3.2%) - p-deep-list (256 nested levels): -2.4% but spread is 46.5% on this stack; treat as inside its own noise band. Single-run full corpus shows the same direction on every other input that contains at least one list (p-many-fenced-code -25.6%, p-many-images -23.2%, p-many-char-refs -23.9%, p-many-links -26.5%, p-tab-heavy -23.2%, p-html-blocks -17.1%, etc.). Trade-offs / inputs that do not contain lists: - legacy-strong / legacy-strong-emph (raw 'a**b' x 1e4 emphasis patterns) reported +13% / +28% on a single run, but their input contains zero lists so prepareList is never invoked. Cross-run spread on these scenarios is 44-52% on the baseline alone; the +/- numbers here are noise inside that band. - p-long-para, p-unicode-heavy, p-mismatched-emph: also list-free; all three move within +/-3% of baseline (noise). Tests: dev + prod 1448/1448. mdast-util-gfm 54/54. mdast-util-mdx 11/13 — the two failing tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
Octopus merge of the three independent mdast-util-from-markdown perf branches into one rollup so maintainers can evaluate the cumulative impact on a single bench run. Each underlying branch is also pushed on its own and can land independently. Branches merged: - perf/prepare-list-no-splice — batch listItem insertions (Closes syntax-tree#49, Refs syntax-tree#50) - perf/dispatch-context-reuse — single shared dispatch this-binding - perf/stable-node-shape — pre-declare position in every node factory Cumulative impact (multi-run median-of-medians vs the mdast baseline; spread in parentheses): - p-many-char-refs (10k '&' entities): -43.8% (7.5%) - p-wide-list (10k single-level items): -42.9% (11.6% — borderline) - l (~564 KB CommonMark spec * 35): -13.8% (1.0% — very clean) - s (full CommonMark spec): -12.3% (7.9%) - m (CommonMark spec * 7): -8.5% (1.0% — very clean) - legacy-base (40 KB 'xxxx' x 1e4): -4.0% (9.0%) Single-run full corpus shows wins on 20 of 27 scenarios in the range -1% to -43%. The largest pathological wins are p-many-char-refs -42.2%, p-many-fenced-code -40.7%, p-wide-list -37.0%, p-tab-heavy -35.6%, p-many-headings -28.2%, p-code-spans -24.4%, p-many-links -22.3%, p-many-images -21.5%, p-html-blocks -15.3%, xs -16.4%. Trade-offs / inputs that do not improve: - p-long-para: +5.3% multi-run (spread 9.4%) and +11.6% single-run. This input is one giant text node; none of the three changes target the single-text-node path. The +5.3% is on the edge of its noise band but is the largest credible regression of the rollup. - p-unicode-heavy: +2.3% multi-run (spread 11.4%, NOISY) — within the scenario's noise band. - p-mismatched-emph: +0.8% multi-run (7.3%) — flat. - legacy-strong / legacy-strong-emph: +44% / +58% single-run, but these scenarios have 12-sample mandatory floor on this stack and cross-run spread of 44-53% on baseline alone; the deltas reported here are inside that band. The input is pure 'a**b' x 1e4 with no lists / few node creations / few mid-event sliceSerialize variants, so none of the three optimisations target what this scenario exercises. Tests: dev + prod 1448/1448. mdast-util-gfm 54/54. mdast-util-mdx 11/13 — the two failing tests reproduce on upstream/main and are not introduced by this branch.
prepareList synthesises listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to 8 insertions take a fast path that splices each insertion in reverse order so unsplice'd positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -38.0% (7.9%) 5,000 ATX headings -18.3% (3.8%) one CommonMark example -8.3% (7.5%) CommonMark spec * 35 (~564 KB) -8.0% (2.6%) full CommonMark spec (~16 KB) -7.2% (11.3%) CommonMark spec * 7 (~113 KB) -2.5% (3.2%) 256 nested ordered list levels -2.4% (46.5% spread, treat as flat on this stack) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
prepareList synthesises listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to 8 insertions take a fast path that splices each insertion in reverse order so unsplice'd positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -38.0% (7.9%) 5,000 ATX headings -18.3% (3.8%) one CommonMark example -8.3% (7.5%) CommonMark spec * 35 (~564 KB) -8.0% (2.6%) full CommonMark spec (~16 KB) -7.2% (11.3%) CommonMark spec * 7 (~113 KB) -2.5% (3.2%) 256 nested ordered list levels -2.4% (46.5% spread, treat as flat on this stack) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
Octopus merge of the three independent perf branches into one rollup so reviewers can evaluate the cumulative impact on a single bench run. Each underlying branch is also pushed on its own and can land independently. Branches merged: - perf/prepare-list-no-splice (Closes syntax-tree#49, Refs syntax-tree#50) - perf/dispatch-context-reuse - perf/stable-node-shape Cumulative impact, multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 character entity references -43.8% (7.5%) 10,000 single-level list items -42.9% (11.6%, borderline) CommonMark spec * 35 (~564 KB) -13.8% (1.0%, very clean) full CommonMark spec (~16 KB) -12.3% (7.9%) CommonMark spec * 7 (~113 KB) -8.5% (1.0%, very clean) 'xxxx' x 10,000 (~40 KB) -4.0% (9.0%) Single-run full corpus shows wins on 20 of 27 inputs, ranging from -1% to -43%. The largest pathological wins are inputs heavy in character entity references (-42.2%), fenced code blocks (-40.7%), single-level list items (-37.0%), tabs (-35.6%), ATX headings (-28.2%), backtick code spans (-24.4%), inline links (-22.3%), inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark example (-16.4%). Trade-offs: A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread, and +11.6% on a single full-corpus run. None of the three changes target the single-text-node path, so the small regression is the edge of that input's noise band. A 256 KB Unicode-heavy input reported +2.3% multi-run inside its 11.4% spread (treat as flat). A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat). The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +44% and +58% on a single run, but their cross-run spread is 44 to 53% on the baseline alone. The input shape (almost all attentionSequence events that mostly do not match a handler, no lists, no node-creation hot path) means none of the three optimisations can target what these inputs exercise. Treat the deltas as noise. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch.
prepareList synthesises listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to 8 insertions take a fast path that splices each insertion in reverse order so unsplice'd positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. Both thresholds were swept on real inputs rather than picked by feel. SMALL_LIST_LIMIT was tested at {0, 2, 4, 8, 16, 32, 64}: deeper nesting (a 256-level list) preferred 16 to 64 (around 21% faster than 0), but typical documents (CommonMark spec, spec * 7, spec * 35) preferred lower values because the rebuild path's allocation + sort overhead outweighs the saved splice work when lists are 4 to 12 items each. 8 sits at the balance point and keeps the validated multi-run wins on the typical-document inputs. The chunked-spread threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 was the lowest median across the 10000-item single-level list and both spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -38.0% (7.9%) 5,000 ATX headings -18.3% (3.8%) one CommonMark example -8.3% (7.5%) CommonMark spec * 35 (~564 KB) -8.0% (2.6%) full CommonMark spec (~16 KB) -7.2% (11.3%) CommonMark spec * 7 (~113 KB) -2.5% (3.2%) 256 nested ordered list levels -2.4% (46.5% spread, treat as flat on this stack) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
Octopus merge of the three independent perf branches into one rollup so reviewers can evaluate the cumulative impact on a single bench run. Each underlying branch is also pushed on its own and can land independently. Branches merged: - perf/prepare-list-no-splice (Closes syntax-tree#49, Refs syntax-tree#50) - perf/dispatch-context-reuse - perf/stable-node-shape Cumulative impact, multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 character entity references -43.8% (7.5%) 10,000 single-level list items -42.9% (11.6%, borderline) CommonMark spec * 35 (~564 KB) -13.8% (1.0%, very clean) full CommonMark spec (~16 KB) -12.3% (7.9%) CommonMark spec * 7 (~113 KB) -8.5% (1.0%, very clean) 'xxxx' x 10,000 (~40 KB) -4.0% (9.0%) Single-run full corpus shows wins on 20 of 27 inputs, ranging from -1% to -43%. The largest pathological wins are inputs heavy in character entity references (-42.2%), fenced code blocks (-40.7%), single-level list items (-37.0%), tabs (-35.6%), ATX headings (-28.2%), backtick code spans (-24.4%), inline links (-22.3%), inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark example (-16.4%). Trade-offs: A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread, and +11.6% on a single full-corpus run. None of the three changes target the single-text-node path, so the small regression is the edge of that input's noise band. A 256 KB Unicode-heavy input reported +2.3% multi-run inside its 11.4% spread (treat as flat). A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat). The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +44% and +58% on a single run, but their cross-run spread is 44 to 53% on the baseline alone. The input shape (almost all attentionSequence events that mostly do not match a handler, no lists, no node-creation hot path) means none of the three optimisations can target what these inputs exercise. Treat the deltas as noise. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch.
prepareList synthesises listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to 8 insertions take a fast path that splices each insertion in reverse order so unsplice'd positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. Both thresholds were swept on real inputs rather than picked by feel. SMALL_LIST_LIMIT was tested at {0, 2, 4, 8, 16, 32, 64}: deeper nesting (a 256-level list) preferred 16 to 64 (around 21% faster than 0), but typical documents (CommonMark spec, spec * 7, spec * 35) preferred lower values because the rebuild path's allocation + sort overhead outweighs the saved splice work when lists are 4 to 12 items each. 8 sits at the balance point and keeps the validated multi-run wins on the typical-document inputs. The chunked-spread threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 was the lowest median across the 10000-item single-level list and both spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -38.0% (7.9%) 5,000 ATX headings -18.3% (3.8%) one CommonMark example -8.3% (7.5%) CommonMark spec * 35 (~564 KB) -8.0% (2.6%) full CommonMark spec (~16 KB) -7.2% (11.3%) CommonMark spec * 7 (~113 KB) -2.5% (3.2%) 256 nested ordered list levels -2.4% (46.5% spread, treat as flat on this stack) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
prepareList synthesises listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to 8 insertions take a fast path that splices each insertion in reverse order so unsplice'd positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. Both thresholds were swept on real inputs rather than picked by feel. SMALL_LIST_LIMIT was tested at {0, 2, 4, 8, 16, 32, 64}: deeper nesting (a 256-level list) preferred 16 to 64 (around 21% faster than 0), but typical documents (CommonMark spec, spec * 7, spec * 35) preferred lower values because the rebuild path's allocation + sort overhead outweighs the saved splice work when lists are 4 to 12 items each. 8 sits at the balance point and keeps the validated multi-run wins on the typical-document inputs. The chunked-spread threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 was the lowest median across the 10000-item single-level list and both spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -38.0% (7.9%) 5,000 ATX headings -18.3% (3.8%) one CommonMark example -8.3% (7.5%) CommonMark spec * 35 (~564 KB) -8.0% (2.6%) full CommonMark spec (~16 KB) -7.2% (11.3%) CommonMark spec * 7 (~113 KB) -2.5% (3.2%) 256 nested ordered list levels -2.4% (46.5% spread, treat as flat on this stack) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
Octopus merge of the three independent perf branches into one rollup so reviewers can evaluate the cumulative impact on a single bench run. Each underlying branch is also pushed on its own and can land independently. Branches merged: - perf/prepare-list-no-splice (Closes syntax-tree#49, Refs syntax-tree#50) - perf/dispatch-context-reuse - perf/stable-node-shape Cumulative impact, multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 character entity references -43.8% (7.5%) 10,000 single-level list items -42.9% (11.6%, borderline) CommonMark spec * 35 (~564 KB) -13.8% (1.0%, very clean) full CommonMark spec (~16 KB) -12.3% (7.9%) CommonMark spec * 7 (~113 KB) -8.5% (1.0%, very clean) 'xxxx' x 10,000 (~40 KB) -4.0% (9.0%) Single-run full corpus shows wins on 20 of 27 inputs, ranging from -1% to -43%. The largest pathological wins are inputs heavy in character entity references (-42.2%), fenced code blocks (-40.7%), single-level list items (-37.0%), tabs (-35.6%), ATX headings (-28.2%), backtick code spans (-24.4%), inline links (-22.3%), inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark example (-16.4%). Trade-offs: A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread, and +11.6% on a single full-corpus run. None of the three changes target the single-text-node path, so the small regression is the edge of that input's noise band. A 256 KB Unicode-heavy input reported +2.3% multi-run inside its 11.4% spread (treat as flat). A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat). The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +44% and +58% on a single run, but their cross-run spread is 44 to 53% on the baseline alone. The input shape (almost all attentionSequence events that mostly do not match a handler, no lists, no node-creation hot path) means none of the three optimizations can target what these inputs exercise. Treat the deltas as noise. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch.
prepareList synthesizes listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to a small number of insertions take a fast path that splices each insertion in reverse order so unsplice'd positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. How the cut points were chosen: There are two thresholds in the new code: SMALL_LIST_LIMIT chooses between fast-path splice loop and rebuild; SAFE_SPREAD chooses between a single spread and a chunked spread. SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path costs O(K * suffix) because each of the K splices shifts the events suffix; the rebuild path costs O(N + K) plus a fixed allocation and sort overhead. Below some K the rebuild's constant overhead dominates; above some K the fast path's K * suffix dominates. Because suffix size and per-insertion splice cost both vary with document shape, no single value is universally best: deeper nesting prefers higher limits (everything stays on the splice loop), and documents with a few moderately-sized lists prefer lower limits (the rebuild's lower per-item cost wins). The threshold was chosen by sweeping {0, 2, 4, 8, 16, 32, 64} with representative inputs from both regimes and picking the value that kept the validated multi-run wins on typical-document inputs without regressing the deep-nest case beyond its own noise band. SAFE_SPREAD is set by V8's argument count limit. Spreading a very large array into events.splice can throw a stack overflow in some V8 versions, so the rebuild splits the new sub-array into chunks once it exceeds a safe size. The chunk threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median across the wide-list and spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper, which tracks the same V8 constraint. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -38.0% (7.9%) 5,000 ATX headings -18.3% (3.8%) one CommonMark example -8.3% (7.5%) CommonMark spec * 35 (~564 KB) -8.0% (2.6%) full CommonMark spec (~16 KB) -7.2% (11.3%) CommonMark spec * 7 (~113 KB) -2.5% (3.2%) 256 nested ordered list levels -2.4% (46.5% spread, treat as flat on this stack) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
Octopus merge of the three independent perf branches into one rollup so reviewers can evaluate the cumulative impact on a single bench run. Each underlying branch is also pushed on its own and can land independently. Branches merged: - perf/prepare-list-no-splice (Closes syntax-tree#49, Refs syntax-tree#50) - perf/dispatch-context-reuse - perf/stable-node-shape Cumulative impact, multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 character entity references -43.8% (7.5%) 10,000 single-level list items -42.9% (11.6%, borderline) CommonMark spec * 35 (~564 KB) -13.8% (1.0%, very clean) full CommonMark spec (~16 KB) -12.3% (7.9%) CommonMark spec * 7 (~113 KB) -8.5% (1.0%, very clean) 'xxxx' x 10,000 (~40 KB) -4.0% (9.0%) Single-run full corpus shows wins on 20 of 27 inputs, ranging from -1% to -43%. The largest pathological wins are inputs heavy in character entity references (-42.2%), fenced code blocks (-40.7%), single-level list items (-37.0%), tabs (-35.6%), ATX headings (-28.2%), backtick code spans (-24.4%), inline links (-22.3%), inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark example (-16.4%). Trade-offs: A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread, and +11.6% on a single full-corpus run. None of the three changes target the single-text-node path, so the small regression is the edge of that input's noise band. A 256 KB Unicode-heavy input reported +2.3% multi-run inside its 11.4% spread (treat as flat). A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat). The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +44% and +58% on a single run, but their cross-run spread is 44 to 53% on the baseline alone. The input shape (almost all attentionSequence events that mostly do not match a handler, no lists, no node-creation hot path) means none of the three optimizations can target what these inputs exercise. Treat the deltas as noise. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch.
|
I opened a follow up at #51 that addresses the small list and nested list slowdown, while keeping the speed up for large lists |
prepareList synthesizes listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to a small number of insertions take a fast path that splices each insertion in reverse order so unsplice'd positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. How the cut points were chosen: There are two thresholds in the new code: SMALL_LIST_LIMIT chooses between fast-path splice loop and rebuild; SAFE_SPREAD chooses between a single spread and a chunked spread. SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path costs O(K * suffix) because each of the K splices shifts the events suffix; the rebuild path costs O(N + K) plus a fixed allocation and sort overhead. Below some K the rebuild's constant overhead dominates; above some K the fast path's K * suffix dominates. Because suffix size and per-insertion splice cost both vary with document shape, no single value is universally best: deeper nesting prefers higher limits (everything stays on the splice loop), and documents with a few moderately-sized lists prefer lower limits (the rebuild's lower per-item cost wins). The threshold was chosen by sweeping {0, 2, 4, 8, 16, 32, 64} with representative inputs from both regimes and picking the value that kept the validated multi-run wins on typical-document inputs without regressing the deep-nest case beyond its own noise band. SAFE_SPREAD is set by V8's argument count limit. Spreading a very large array into events.splice can throw a stack overflow in some V8 versions, so the rebuild splits the new sub-array into chunks once it exceeds a safe size. The chunk threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median across the wide-list and spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper, which tracks the same V8 constraint. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -38.0% (7.9%) 5,000 ATX headings -18.3% (3.8%) one CommonMark example -8.3% (7.5%) CommonMark spec * 35 (~564 KB) -8.0% (2.6%) full CommonMark spec (~16 KB) -7.2% (11.3%) CommonMark spec * 7 (~113 KB) -2.5% (3.2%) 256 nested ordered list levels -2.4% (46.5% spread, treat as flat on this stack) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
Octopus merge of the three independent perf branches into one rollup so reviewers can evaluate the cumulative impact on a single bench run. Each underlying branch is also pushed on its own and can land independently. Branches merged: - perf/prepare-list-no-splice (Closes syntax-tree#49, Refs syntax-tree#50) - perf/dispatch-context-reuse - perf/stable-node-shape Cumulative impact, multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 character entity references -43.8% (7.5%) 10,000 single-level list items -42.9% (11.6%, borderline) CommonMark spec * 35 (~564 KB) -13.8% (1.0%, very clean) full CommonMark spec (~16 KB) -12.3% (7.9%) CommonMark spec * 7 (~113 KB) -8.5% (1.0%, very clean) 'xxxx' x 10,000 (~40 KB) -4.0% (9.0%) Single-run full corpus shows wins on 20 of 27 inputs, ranging from -1% to -43%. The largest pathological wins are inputs heavy in character entity references (-42.2%), fenced code blocks (-40.7%), single-level list items (-37.0%), tabs (-35.6%), ATX headings (-28.2%), backtick code spans (-24.4%), inline links (-22.3%), inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark example (-16.4%). Trade-offs: A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread, and +11.6% on a single full-corpus run. None of the three changes target the single-text-node path, so the small regression is the edge of that input's noise band. A 256 KB Unicode-heavy input reported +2.3% multi-run inside its 11.4% spread (treat as flat). A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat). The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +44% and +58% on a single run, but their cross-run spread is 44 to 53% on the baseline alone. The input shape (almost all attentionSequence events that mostly do not match a handler, no lists, no node-creation hot path) means none of the three optimizations can target what these inputs exercise. Treat the deltas as noise. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch.
prepareList synthesizes listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to a small number of insertions take a fast path that splices each insertion in reverse order so unsplice'd positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. How the cut points were chosen: There are two thresholds in the new code: SMALL_LIST_LIMIT chooses between fast-path splice loop and rebuild; SAFE_SPREAD chooses between a single spread and a chunked spread. SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path costs O(K * suffix) because each of the K splices shifts the events suffix; the rebuild path costs O(N + K) plus a fixed allocation and sort overhead. Below some K the rebuild's constant overhead dominates; above some K the fast path's K * suffix dominates. Because suffix size and per-insertion splice cost both vary with document shape, no single value is universally best: deeper nesting prefers higher limits (everything stays on the splice loop), and documents with a few moderately-sized lists prefer lower limits (the rebuild's lower per-item cost wins). The threshold was chosen by sweeping {0, 2, 4, 8, 16, 32, 64} with representative inputs from both regimes and picking the value that kept the validated multi-run wins on typical-document inputs without regressing the deep-nest case beyond its own noise band. SAFE_SPREAD is set by V8's argument count limit. Spreading a very large array into events.splice can throw a stack overflow in some V8 versions, so the rebuild splits the new sub-array into chunks once it exceeds a safe size. The chunk threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median across the wide-list and spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper, which tracks the same V8 constraint. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -38.0% (7.9%) 5,000 ATX headings -18.3% (3.8%) one CommonMark example -8.3% (7.5%) CommonMark spec * 35 (~564 KB) -8.0% (2.6%) full CommonMark spec (~16 KB) -7.2% (11.3%) CommonMark spec * 7 (~113 KB) -2.5% (3.2%) 256 nested ordered list levels -2.4% (46.5% spread, treat as flat on this stack) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
Octopus merge of the three independent perf branches into one rollup so reviewers can evaluate the cumulative impact on a single bench run. Each underlying branch is also pushed on its own and can land independently. Branches merged: - perf/prepare-list-no-splice (Closes syntax-tree#49, Refs syntax-tree#50) - perf/dispatch-context-reuse - perf/stable-node-shape Cumulative impact, multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 character entity references -43.8% (7.5%) 10,000 single-level list items -42.9% (11.6%, borderline) CommonMark spec * 35 (~564 KB) -13.8% (1.0%, very clean) full CommonMark spec (~16 KB) -12.3% (7.9%) CommonMark spec * 7 (~113 KB) -8.5% (1.0%, very clean) 'xxxx' x 10,000 (~40 KB) -4.0% (9.0%) Single-run full corpus shows wins on 20 of 27 inputs, ranging from -1% to -43%. The largest pathological wins are inputs heavy in character entity references (-42.2%), fenced code blocks (-40.7%), single-level list items (-37.0%), tabs (-35.6%), ATX headings (-28.2%), backtick code spans (-24.4%), inline links (-22.3%), inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark example (-16.4%). Trade-offs: A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread, and +11.6% on a single full-corpus run. None of the three changes target the single-text-node path, so the small regression is the edge of that input's noise band. A 256 KB Unicode-heavy input reported +2.3% multi-run inside its 11.4% spread (treat as flat). A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat). The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +44% and +58% on a single run, but their cross-run spread is 44 to 53% on the baseline alone. The input shape (almost all attentionSequence events that mostly do not match a handler, no lists, no node-creation hot path) means none of the three optimizations can target what these inputs exercise. Treat the deltas as noise. Tests pass: dev + prod 1450/1450, 100% coverage maintained. mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch.
prepareList synthesizes listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to a small number of insertions take a fast path that splices each insertion in reverse order so unsplice'd positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. How the cut points were chosen: There are two thresholds in the new code: SMALL_LIST_LIMIT chooses between fast-path splice loop and rebuild; SAFE_SPREAD chooses between a single spread and a chunked spread. SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path costs O(K * suffix) because each of the K splices shifts the events suffix; the rebuild path costs O(N + K) plus a fixed allocation and sort overhead. Below some K the rebuild's constant overhead dominates; above some K the fast path's K * suffix dominates. Because suffix size and per-insertion splice cost both vary with document shape, no single value is universally best: deeper nesting prefers higher limits (everything stays on the splice loop), and documents with a few moderately-sized lists prefer lower limits (the rebuild's lower per-item cost wins). The threshold was chosen by sweeping {0, 2, 4, 8, 16, 32, 64} with representative inputs from both regimes and picking the value that kept the validated multi-run wins on typical-document inputs without regressing the deep-nest case beyond its own noise band. SAFE_SPREAD is set by V8's argument count limit. Spreading a very large array into events.splice can throw a stack overflow in some V8 versions, so the rebuild splits the new sub-array into chunks once it exceeds a safe size. The chunk threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median across the wide-list and spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper, which tracks the same V8 constraint. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -38.0% (7.9%) 5,000 ATX headings -18.3% (3.8%) one CommonMark example -8.3% (7.5%) CommonMark spec * 35 (~564 KB) -8.0% (2.6%) full CommonMark spec (~16 KB) -7.2% (11.3%) CommonMark spec * 7 (~113 KB) -2.5% (3.2%) 256 nested ordered list levels -2.4% (46.5% spread, treat as flat on this stack) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
prepareList synthesizes listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to a small number of insertions take a fast path that splices each insertion in reverse order so unsplice'd positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. How the cut points were chosen: There are two thresholds in the new code: SMALL_LIST_LIMIT chooses between fast-path splice loop and rebuild; SAFE_SPREAD chooses between a single spread and a chunked spread. SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path costs O(K * suffix) because each of the K splices shifts the events suffix; the rebuild path costs O(N + K) plus a fixed allocation and sort overhead. Below some K the rebuild's constant overhead dominates; above some K the fast path's K * suffix dominates. Because suffix size and per-insertion splice cost both vary with document shape, no single value is universally best: deeper nesting prefers higher limits (everything stays on the splice loop), and documents with a few moderately-sized lists prefer lower limits (the rebuild's lower per-item cost wins). The threshold was chosen by sweeping {0, 2, 4, 8, 16, 32, 64} with representative inputs from both regimes and picking the value that kept the validated multi-run wins on typical-document inputs without regressing the deep-nest case beyond its own noise band. SAFE_SPREAD is set by V8's argument count limit. Spreading a very large array into events.splice can throw a stack overflow in some V8 versions, so the rebuild splits the new sub-array into chunks once it exceeds a safe size. The chunk threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median across the wide-list and spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper, which tracks the same V8 constraint. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -38.0% (7.9%) 5,000 ATX headings -18.3% (3.8%) one CommonMark example -8.3% (7.5%) CommonMark spec * 35 (~564 KB) -8.0% (2.6%) full CommonMark spec (~16 KB) -7.2% (11.3%) CommonMark spec * 7 (~113 KB) -2.5% (3.2%) 256 nested ordered list levels -2.4% (46.5% spread, treat as flat on this stack) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
Octopus merge of the three independent perf branches into one rollup so reviewers can evaluate the cumulative impact on a single bench run. Each underlying branch is also pushed on its own and can land independently. Branches merged: - perf/prepare-list-no-splice (Closes syntax-tree#49, Refs syntax-tree#50) - perf/dispatch-context-reuse - perf/stable-node-shape Cumulative impact, multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 character entity references -43.8% (7.5%) 10,000 single-level list items -42.9% (11.6%, borderline) CommonMark spec * 35 (~564 KB) -13.8% (1.0%, very clean) full CommonMark spec (~16 KB) -12.3% (7.9%) CommonMark spec * 7 (~113 KB) -8.5% (1.0%, very clean) 'xxxx' x 10,000 (~40 KB) -4.0% (9.0%) Single-run full corpus shows wins on 20 of 27 inputs, ranging from -1% to -43%. The largest pathological wins are inputs heavy in character entity references (-42.2%), fenced code blocks (-40.7%), single-level list items (-37.0%), tabs (-35.6%), ATX headings (-28.2%), backtick code spans (-24.4%), inline links (-22.3%), inline images (-21.5%), HTML blocks (-15.3%), and one CommonMark example (-16.4%). Trade-offs: A 1 MB single paragraph reported +5.3% multi-run with a 9.4% spread, and +11.6% on a single full-corpus run. None of the three changes target the single-text-node path, so the small regression is the edge of that input's noise band. A 256 KB Unicode-heavy input reported +2.3% multi-run inside its 11.4% spread (treat as flat). A 10,000-unmatched-asterisk input moved +0.8% multi-run (flat). The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +44% and +58% on a single run, but their cross-run spread is 44 to 53% on the baseline alone. The input shape (almost all attentionSequence events that mostly do not match a handler, no lists, no node-creation hot path) means none of the three optimizations can target what these inputs exercise. Treat the deltas as noise. Tests pass: dev + prod 1450/1450, 100% coverage maintained. mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch.
prepareList synthesizes listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to a small number of insertions take a fast path that splices each insertion in reverse order so unspliced positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild and use a chunked spread so the splice never hits V8's argument count limit. How the cut points were chosen: There are two thresholds in the new code: SMALL_LIST_LIMIT chooses between fast-path splice loop and rebuild; SAFE_SPREAD chooses between a single spread and a chunked spread. SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path costs O(K * suffix) because each of the K splices shifts the events suffix; the rebuild path costs O(N + K) plus a fixed allocation overhead. Below some K the rebuild's constant overhead dominates; above some K the fast path's K * suffix dominates. Because suffix size and per-insertion splice cost both vary with document shape, no single value is universally best: deeper nesting prefers higher limits (everything stays on the splice loop), and documents with a few moderately-sized lists prefer lower limits (the rebuild's lower per-item cost wins). The threshold was chosen by sweeping {0, 2, 4, 8, 16, 32, 64} with representative inputs from both regimes and picking the value that kept the validated multi-run wins on typical-document inputs without regressing the deep-nest case beyond its own noise band. SAFE_SPREAD is set by V8's argument count limit. Spreading a very large array into events.splice can throw a stack overflow in some V8 versions, so the rebuild splits the new sub-array into chunks once it exceeds a safe size. The chunk threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median across the wide-list and spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper, which tracks the same V8 constraint. Pass-1 records insertions in non-decreasing `at` order by construction (each boundary records its exit insertion at `lineIndex || index` followed by its enter insertion at `index`, and the next boundary's tail walk is bounded by the previous boundary's listItemPrefix), so no sort is needed in the slow path. A dev-only assertion verifies the invariant on every slow-path call. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -36.3% (4.0%) 5,000 ATX headings -22.9% (5.5%) one CommonMark example -13.2% (16.0%) full CommonMark spec (~16 KB) -13.0% (46.4%, NOISY) CommonMark spec * 35 (~564 KB) -10.7% (0.8%, very clean) 256 nested ordered list levels -4.0% (6.7%) CommonMark spec * 7 (~113 KB) -3.8% (12.8%) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +13% and +28% on a single run, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1450/1450, 100% coverage maintained. mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
prepareList synthesizes listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to a small number of insertions take a fast path that splices each insertion in reverse order so unspliced positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild that writes the replacement into a fresh array, then either splices the whole replacement in one call (when it fits below V8's spread argument limit) or shifts the suffix once and writes the replacement into the vacated range. The in-place shift avoids the per-chunk splice loop a chunked spread fallback would use, which would re-introduce O(K * N) shift cost on very wide lists. How the cut points were chosen: There are two thresholds in the new code: SMALL_LIST_LIMIT chooses between fast-path splice loop and rebuild; SAFE_SPREAD chooses between a single spread and an in-place shift. SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path costs O(K * suffix) because each of the K splices shifts the events suffix; the rebuild path costs O(N + K) plus a fixed allocation overhead. Below some K the rebuild's constant overhead dominates; above some K the fast path's K * suffix dominates. Because suffix size and per-insertion splice cost both vary with document shape, no single value is universally best: deeper nesting prefers higher limits (everything stays on the splice loop), and documents with a few moderately-sized lists prefer lower limits (the rebuild's lower per-item cost wins). The threshold was chosen by sweeping {0, 2, 4, 8, 16, 32, 64} with representative inputs from both regimes and picking the value that kept the validated multi-run wins on typical-document inputs without regressing the deep-nest case beyond its own noise band. SAFE_SPREAD is set by V8's argument count limit. Spreading a very large array into events.splice can throw a stack overflow in some V8 versions, so above the threshold the rebuild instead resizes events once, shifts the suffix to its target position, and writes the replacement into the vacated range. Total work is O(suffix + replacement.length), independent of how many insertions were queued. The single-spread threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median across the wide-list and spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper. Pass-1 records insertions in non-decreasing `at` order by construction (each boundary records its exit insertion at `lineIndex || index` followed by its enter insertion at `index`, and the next boundary's tail walk is bounded by the previous boundary's listItemPrefix), so no sort is needed in the slow path. A dev-only assertion verifies the invariant on every slow-path call. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -40.8% (5.6%) 5,000 ATX headings -20.6% (4.3%) CommonMark spec * 35 (~564 KB) -11.7% (1.6%, very clean) one CommonMark example -9.1% (105%, NOISY) 256 nested ordered list levels -8.8% (20.3%) full CommonMark spec (~16 KB) -7.0% (35.4%, NOISY) CommonMark spec * 7 (~113 KB) -3.6% (1.0%, very clean) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported large deltas in single runs, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1452/1452, 100% coverage maintained. mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
prepareList synthesizes listItem enter and exit events one item at a time using events.splice(at, 0, [event]). Each splice shifts the suffix of the events array, so a list with K items inside an array of N events does O(K * N) shift work. This is the dominant cost in mdast-util-from-markdown's contribution to wide-list inputs and the slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50. The fix collects the would-be splices into an insertions queue during the existing walk and applies them outside the loop in one pass. Two paths handle the work efficiently: lists with up to a small number of insertions take a fast path that splices each insertion in reverse order so unspliced positions stay valid, which avoids the cost of allocating a fresh sub-array; longer lists go through a batched rebuild that writes the replacement into a fresh array, then either splices the whole replacement in one call (when it fits below V8's spread argument limit) or shifts the suffix once and writes the replacement into the vacated range. The in-place shift avoids the per-chunk splice loop a chunked spread fallback would use, which would re-introduce O(K * N) shift cost on very wide lists. How the cut points were chosen: There are two thresholds in the new code: SMALL_LIST_LIMIT chooses between fast-path splice loop and rebuild; SAFE_SPREAD chooses between a single spread and an in-place shift. SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path costs O(K * suffix) because each of the K splices shifts the events suffix; the rebuild path costs O(N + K) plus a fixed allocation overhead. Below some K the rebuild's constant overhead dominates; above some K the fast path's K * suffix dominates. Because suffix size and per-insertion splice cost both vary with document shape, no single value is universally best: deeper nesting prefers higher limits (everything stays on the splice loop), and documents with a few moderately-sized lists prefer lower limits (the rebuild's lower per-item cost wins). The threshold was chosen by sweeping {0, 2, 4, 8, 16, 32, 64} with representative inputs from both regimes and picking the value that kept the validated multi-run wins on typical-document inputs without regressing the deep-nest case beyond its own noise band. SAFE_SPREAD is set by V8's argument count limit. Spreading a very large array into events.splice can throw a stack overflow in some V8 versions, so above the threshold the rebuild instead resizes events once, shifts the suffix to its target position, and writes the replacement into the vacated range. Total work is O(suffix + replacement.length), independent of how many insertions were queued. The single-spread threshold was tested at {1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median across the wide-list and spec-derived inputs and matches the threshold micromark-util-chunked already uses for its own splice helper. Pass-1 records insertions in non-decreasing `at` order by construction (each boundary records its exit insertion at `lineIndex || index` followed by its enter insertion at `index`, and the next boundary's tail walk is bounded by the previous boundary's listItemPrefix), so no sort is needed in the slow path. A dev-only assertion verifies the invariant on every slow-path call. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 single-level list items -40.8% (5.6%) 5,000 ATX headings -20.6% (4.3%) CommonMark spec * 35 (~564 KB) -11.7% (1.6%, very clean) one CommonMark example -9.1% (105%, NOISY) 256 nested ordered list levels -8.8% (20.3%) full CommonMark spec (~16 KB) -7.0% (35.4%, NOISY) CommonMark spec * 7 (~113 KB) -3.6% (1.0%, very clean) Single-run full corpus runs show the same direction on every other input that contains at least one list, with wins of -17% to -27% on inputs heavy in fenced code blocks, images, character references, inline links, tabs, and HTML blocks. The largest improvement is the 10,000-item single-level list input, which is the worst case for the old per-item splice loop. Trade-offs and inputs that do not move: Inputs that contain no lists are unaffected by the change because prepareList is never invoked. The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported large deltas in single runs, but those inputs have a cross-run spread of 44 to 52% on the baseline alone, so the apparent regressions sit inside their own noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input, and 10,000 unmatched asterisks all moved within +/- 3% of baseline. Tests pass: dev + prod 1454/1454, 100% coverage maintained. Three new tests cover the rebuild path: a wide-list parse-and-line spot-check, plus first-item deepEqual against a 4-item fast-path reference for both tight and loose lists (so a bug confined to the rebuild branch diverges from the fast path the reference uses). mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch. Closes syntax-tree#49 Refs syntax-tree#50
Initial checklist
Description of changes
prepareListcallsevents.splice()twice per list item, making it O(n²). This defers insertions into arrays and applies them in a single backward merge pass, making it O(n). Also tightens the backward line-ending scan to stop atstartinstead of 0.Fixes #49