perf(compile): pre-declare position in node factories by ChristianMurphy · Pull Request #53 · syntax-tree/mdast-util-from-markdown

ChristianMurphy · 2026-05-03T13:54:26Z

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and discussions and couldn’t find anything or linked relevant results below
I made sure the docs are up to date
I included tests (or that’s not needed)

Description of changes

Each node factory returned an object literal that did not include a position property. enter() then patched the property in. V8 treated the property addition as a hidden-class transition for every node, so property reads in enter, exit, the compile loop, and any consumer walked a polymorphic shape instead of a stable one.

The fix declares 'position: undefined' as the trailing property on every node factory and on the root tree literal. enter() still mutates position to a real value, but the property already exists, so no hidden-class transition fires. V8 keeps a single hidden class per node type from construction onward.

Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses):

10,000 character entity references -27.7% (1.2%, very clean)
1,000 fenced code blocks -29.9% (13.6%, borderline)
CommonMark spec * 35 (~564 KB) -9.3% (6.9%)
one CommonMark example -7.5% (9.5%)
full CommonMark spec (~16 KB) -6.3% (8.3%)
CommonMark spec * 7 (~113 KB) -2.2% (7.7%)

Single-run full corpus shows the same direction across every input that creates a non-trivial number of nodes: a 1,000-image input -25.9%, a 1,000-link input -24.9%, a 5,000-tab-heavy input -22.4%, a 5,000-HTML-block input -11.6%, a 10,000-list-item input -10.3%, a 5,000-heading input -9.4%, a 10,000-code-span input -8.8%, a 1,000-emphasis-chain input -4.7%, the 'xxxx' x 10,000 input -6.5%, a 1,000-autolink input -5.6%.

Trade-offs:

In exchange for stable hidden classes, every node now carries a 'position: undefined' field even when position would never have been added. In practice every node in this codebase gets a real position assigned by enter(), so the only cost is one undefined slot for the brief window between factory and enter() call.

Inputs where the gain is small or absent:

A 1 MB single paragraph reported +2.9% on a single run. That input produces one giant text node whose hidden-class shape was already monomorphic, so the change has no benefit there and the small move sits inside the input's noise band. A 5,000-reference-link-definition input and a 10,000-unmatched-asterisks input each moved +/- 2% on a single run, also inside their respective noise bands. Multi-run showed legacy-defs at -0.0% (clean, no movement).

The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +17% to +29% on single runs but their cross-run spread is 12 to 50% on this stack alone, and these inputs do not exercise the node-creation hot path that this change targets, so the deltas are noise.

Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch.

Each node factory returned an object literal that did not include a position property. enter() then patched the property in. V8 treated the property addition as a hidden-class transition for every node, so property reads in enter, exit, the compile loop, and any consumer walked a polymorphic shape instead of a stable one. The fix declares 'position: undefined' as the trailing property on every node factory and on the root tree literal. enter() still mutates position to a real value, but the property already exists, so no hidden-class transition fires. V8 keeps a single hidden class per node type from construction onward. Inputs that benefit, with multi-run median-of-medians vs the baseline (spread in parentheses): 10,000 character entity references -27.7% (1.2%, very clean) 1,000 fenced code blocks -29.9% (13.6%, borderline) CommonMark spec * 35 (~564 KB) -9.3% (6.9%) one CommonMark example -7.5% (9.5%) full CommonMark spec (~16 KB) -6.3% (8.3%) CommonMark spec * 7 (~113 KB) -2.2% (7.7%) Single-run full corpus shows the same direction across every input that creates a non-trivial number of nodes: a 1,000-image input -25.9%, a 1,000-link input -24.9%, a 5,000-tab-heavy input -22.4%, a 5,000-HTML-block input -11.6%, a 10,000-list-item input -10.3%, a 5,000-heading input -9.4%, a 10,000-code-span input -8.8%, a 1,000-emphasis-chain input -4.7%, the 'xxxx' x 10,000 input -6.5%, a 1,000-autolink input -5.6%. Trade-offs: In exchange for stable hidden classes, every node now carries a 'position: undefined' field even when position would never have been added. In practice every node in this codebase gets a real position assigned by enter(), so the only cost is one undefined slot for the brief window between factory and enter() call. Inputs where the gain is small or absent: A 1 MB single paragraph reported +2.9% on a single run. That input produces one giant text node whose hidden-class shape was already monomorphic, so the change has no benefit there and the small move sits inside the input's noise band. A 5,000-reference-link-definition input and a 10,000-unmatched-asterisks input each moved +/- 2% on a single run, also inside their respective noise bands. Multi-run showed legacy-defs at -0.0% (clean, no movement). The pure emphasis stress inputs ('a**b' repeated 10,000 times and similar) reported +17% to +29% on single runs but their cross-run spread is 12 to 50% on this stack alone, and these inputs do not exercise the node-creation hot path that this change targets, so the deltas are noise. Tests pass: dev + prod 1448/1448, mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests reproduce on upstream/main and are not introduced by this branch.

Copilot

Pull request overview

This PR applies a small performance optimization in the markdown compiler by pre-declaring the position field on built-in mdast node factories and on the root node. In this codebase, enter() is responsible for attaching positional data during compilation, so the change aims to keep node object shapes stable throughout parsing.

Changes:

Pre-declare position: undefined on the root AST node created in compile().
Pre-declare position: undefined on all built-in node factory return values.
Add an inline comment documenting the hidden-class rationale for the change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ChristianMurphy · 2026-05-04T18:35:15Z

📓 on this, plugins, like GFM should also pre-allocate position, otherwise adding plugins will get back to a spot where the node types are polymorphic and the speed up is partially lost.

Problem: the strikethrough handler creates each delete node as a fresh object literal, then mdast-util-from-markdown's enter() sets node.position afterwards. V8 treats that late assignment as a shape change, so every delete node walks through two hidden classes instead of one. Core nodes already avoid this by declaring position up front; without the same trick on extension nodes, GFM trees end up mixing two layouts and downstream tree walks lose monomorphic dispatch. Goal: declare position on the delete literal at construction so V8 keeps a single hidden class for delete from start to finish. Changes: - lib/index.js: trailing position: undefined on the delete literal in enterStrikethrough. Notes: pairs with the matching core change in syntax-tree/mdast-util-from-markdown#53; merged on its own this branch is a no-op. Validated with npm test (build, format, 100% coverage, 13 of 13 tests in dev and prod). Bench (local harness, 3-run median-of-medians, GFM tokenizer and mdast extensions stacked, with the core branch and three sibling extension branches checked out alongside this one): Input: one paragraph of 1000 ~~strike~~ runs. On the baseline this is the second-largest mdast-layer share in the GFM corpus (full 11.0 ms, tokenize-only 5.2 ms; the mdast layer is the gap between those two). setup delta vs main core branch alone +14.6% core branch with this paired +2.6% The 12-point swing is the recovery this branch is designed to produce: delete nodes now share the core layout, so tree walks stay monomorphic. A drift floor measured separately (main against itself, no code change) ran at +4 to +12 percent across most GFM inputs on this machine in the same window, so the +2.6 percent residual sits inside bench noise.

Problem: the footnote handlers build footnoteReference and footnoteDefinition nodes as fresh literals, then mdast-util-from- markdown's enter() sets node.position afterwards. V8 treats that late assignment as a shape change, so each node walks through two hidden classes instead of one. Core nodes already avoid this by declaring position up front; without the same trick on extension nodes, GFM trees end up with mixed layouts and tree walks pay a polymorphic-dispatch cost. Goal: declare position on both footnote literals at construction so V8 keeps a single hidden class per node type. Changes: - lib/index.js: trailing position: undefined on the footnoteReference literal in enterFootnoteCall. - lib/index.js: trailing position: undefined on the footnoteDefinition literal in enterFootnoteDefinition. Notes: pairs with the matching core change in syntax-tree/mdast-util-from-markdown#53; merged on its own this branch is a no-op. Validated with npm test (build, format, 100% coverage, 22 of 22 tests in dev and prod). Bench (local harness, 3-run median-of-medians, GFM tokenizer and mdast extensions stacked, with the core branch and three sibling extension branches checked out alongside this one): Input: a document with 1000 footnote definitions and 1000 references to them. On the baseline this scenario carries a modest mdast-layer share (full 78.0 ms, tokenize-only 62.4 ms) but every footnote node still pays the shape-change cost without the pre-declare. setup delta vs main core branch alone +19.5% core branch with this paired +13.0% The 6.5-point swing tracks the recovery on footnote nodes. A drift floor measured separately ran at +4 to +12 percent across most GFM inputs on this machine in the same window, putting the +13 percent residual at the upper end of that noise band.

@ts-expect-error

Problem: the table handlers build table, tableRow, and tableCell nodes as fresh literals, then mdast-util-from-markdown's enter() sets node.position afterwards. V8 treats the late assignment as a shape change, so each node walks through two hidden classes instead of one. Core nodes already avoid this by declaring position up front; extension nodes leave the tree with mixed layouts. Goal: declare position on all three factory literals at construction so V8 keeps a single hidden class per node type. Changes: - lib/index.js: trailing position: undefined on the table literal in enterTable. - lib/index.js: trailing position: undefined on the tableRow literal in enterRow. - lib/index.js: trailing position: undefined on the tableCell literal in enterCell. Notes: pairs with the matching core change in syntax-tree/mdast-util-from-markdown#53; merged on its own this branch is a no-op. Validated with npm test (build, format, 100% coverage, 32 of 32 tests in dev and prod). The branch also includes a small chore commit at lib/index.js:233 through 237 that drops three @ts-expect-error directives that no longer match the current markdown-table types and that fail tsc on upstream main as written. Without that cleanup the package's own test script does not pass, so the chore commit is required to land before this one. Squash or split per maintainer preference. Bench (local harness, 3-run median-of-medians, GFM tokenizer and mdast extensions stacked): Inputs: a 4-column 1000-row table and a 20-column 100-row table. Tables sit at low mdast-layer ratios on the baseline (1.08 and 1.09 respectively); the bottleneck for tables is in the micromark-extension- gfm-table tokenizer, not in this util. Numbers should be read as "no regression" rather than "credible win." scenario baseline core alone core + paired 4 cols x 1000 rows 281.1 ms +4.7% +6.2% 20 cols x 100 rows 55.1 ms +5.0% +7.2% The 1 to 2-point worsening from solo to paired sits inside the machine's drift band (a separately measured drift floor of main against itself ran at +4 to +12 percent across most GFM inputs in the same window). The pre-declare does not regress tables; it also does not move the needle for them, which matches what the baseline ratio predicts.

Problem: this extension builds link nodes (and their nested text children) in two paths: the enter handler for literal autolinks during parse, and a post-parse find-and-replace transform that walks text nodes and produces link/text nodes from URL-shaped substrings. In both paths the literals are fresh objects with no position field; position is patched in afterwards. V8 treats the late assignment as a shape change, so every link and text node walks through two hidden classes instead of one. Core nodes already avoid this; extension nodes leave the tree with mixed layouts and tree walks lose monomorphic dispatch. Goal: declare position on every link and text literal this extension produces, both in the enter handler and in the post-parse transform, so V8 keeps a single hidden class per node type. Changes: - lib/index.js: trailing position: undefined on the link literal in enterLiteralAutolink. - lib/index.js: trailing position: undefined on the link and nested text literals returned from findUrl, plus the trailing-text node when splitUrl strips punctuation. - lib/index.js: trailing position: undefined on the link and nested text literals returned from findEmail. Notes: the transform-path literals at lines 175 through 180, 183, and 207 through 212 are returned from find-and-replace callbacks rather than going through this.enter(). A one-off probe (parsing 'visit www.example.com\n' with the GFM extension) confirmed that something further down the pipeline populates position on these nodes, so the pre-declared field gets overwritten with a real value and the test fixtures continue to assert real position objects. Pairs with the matching core change in syntax-tree/mdast-util-from-markdown#53; merged on its own this branch is a no-op. Validated with npm test (build, format, 100% coverage, 49 of 49 tests in dev and prod). Bench (local harness, 3-run median-of-medians, GFM tokenizer and mdast extensions stacked, with the core branch and three sibling extension branches checked out alongside this one): Two scenarios exercise this code path. The first is 5000 bare URLs interleaved with surrounding text, which is the largest mdast-layer share in the GFM corpus on the baseline (full 319.6 ms, tokenize-only 142.8 ms; the gap between those is roughly 35 microseconds of mdast-layer work per URL, mostly node allocation and position bookkeeping). The second is the syntax-tree GFM-stack readme used as a real-world fixture, where the autolink resolver still walks every text node. scenario baseline core alone core + paired 5000 bare URLs + text 319.6 ms +2.3% -2.5% syntax-tree GFM-stack readme 22.8 ms +29.4% +22.5% The bare-URL scenario is the only paired-vs-baseline comparison across the whole 5-repo paired set that crosses zero (i.e., the paired configuration parses faster than upstream main does). The 4.8-point swing from solo to paired matches what the V8 hidden-class hypothesis predicts: with link and text literals sharing the core layout, every link in the resulting tree dispatches through one inline cache. A drift floor measured separately ran at +4 to +12 percent across most GFM inputs on this machine in the same window; the readme's +22.5 percent paired residual is in the upper end of that band, which is consistent with no real regression.

Problem: the strikethrough handler creates each delete node as a fresh object literal, then mdast-util-from-markdown's enter() sets node.position afterwards. V8 treats that late assignment as a shape change, so every delete node walks through two hidden classes instead of one. Core nodes already avoid this by declaring position up front; without the same trick on extension nodes, GFM trees end up mixing two layouts and downstream tree walks lose monomorphic dispatch. Goal: declare position on the delete literal at construction so V8 keeps a single hidden class for delete from start to finish. Changes: - lib/index.js: trailing position: undefined on the delete literal in enterStrikethrough. Notes: pairs with the matching core change in syntax-tree/mdast-util-from-markdown#53; merged on its own this branch is a no-op. Validated with npm test (build, format, 100% coverage, 13 of 13 tests in dev and prod). Bench (local harness, 3-run median-of-medians, GFM tokenizer and mdast extensions stacked). Numbers below are the parse time delta versus an unmodified upstream main; positive numbers mean slower, negative numbers mean faster. Input: one paragraph of 1000 ~~strike~~ runs. On the baseline this scenario carries the second-largest mdast-layer share in the GFM corpus (full 11.0 ms, tokenize-only 5.2 ms; the mdast layer is the gap between those two). Effect of pairing this branch with PR #53: - PR #53 alone, extensions on upstream main: 14.6 percent slower. Trees mix two layouts because delete nodes still take a shape change when position is patched in, so downstream tree-walk call sites go polymorphic. - PR #53 with this branch alongside (and the three sibling extension branches): 2.6 percent slower, which sits inside the bench's drift floor of 4 to 12 percent (main vs main with no code change in the same window). The paired configuration recovers about 12 percentage points relative to PR #53 on its own. Without this branch, PR #53 would appear to regress strikethrough-heavy parses; with it, the parse runs roughly as fast as upstream main does today.

Problem: the footnote handlers build footnoteReference and footnoteDefinition nodes as fresh literals, then mdast-util-from- markdown's enter() sets node.position afterwards. V8 treats that late assignment as a shape change, so each node walks through two hidden classes instead of one. Core nodes already avoid this by declaring position up front; without the same trick on extension nodes, GFM trees end up with mixed layouts and tree walks pay a polymorphic-dispatch cost. Goal: declare position on both footnote literals at construction so V8 keeps a single hidden class per node type. Changes: - lib/index.js: trailing position: undefined on the footnoteReference literal in enterFootnoteCall. - lib/index.js: trailing position: undefined on the footnoteDefinition literal in enterFootnoteDefinition. Notes: pairs with the matching core change in syntax-tree/mdast-util-from-markdown#53; merged on its own this branch is a no-op. Validated with npm test (build, format, 100% coverage, 22 of 22 tests in dev and prod). Bench (local harness, 3-run median-of-medians, GFM tokenizer and mdast extensions stacked). Numbers below are the parse time delta versus an unmodified upstream main; positive numbers mean slower, negative numbers mean faster. Input: a document with 1000 footnote definitions and 1000 references to them. On the baseline this scenario carries a modest mdast-layer share (full 78.0 ms, tokenize-only 62.4 ms) but every footnote node still pays the shape-change cost without the pre-declare. Effect of pairing this branch with PR #53: - PR #53 alone, extensions on upstream main: 19.5 percent slower. - PR #53 with this branch alongside (and the three sibling extension branches): 13.0 percent slower, which sits at the upper end of the bench's drift floor of 4 to 12 percent (main vs main with no code change in the same window). The paired configuration recovers about 6.5 percentage points relative to PR #53 on its own. Without this branch, PR #53 would appear to regress footnote-heavy parses; with it, most of that apparent regression goes away.

@ts-expect-error

Problem: the table handlers build table, tableRow, and tableCell nodes as fresh literals, then mdast-util-from-markdown's enter() sets node.position afterwards. V8 treats the late assignment as a shape change, so each node walks through two hidden classes instead of one. Core nodes already avoid this by declaring position up front; extension nodes leave the tree with mixed layouts. Goal: declare position on all three factory literals at construction so V8 keeps a single hidden class per node type. Changes: - lib/index.js: trailing position: undefined on the table literal in enterTable. - lib/index.js: trailing position: undefined on the tableRow literal in enterRow. - lib/index.js: trailing position: undefined on the tableCell literal in enterCell. Notes: pairs with the matching core change in syntax-tree/mdast-util-from-markdown#53; merged on its own this branch is a no-op. Validated with npm test (build, format, 100% coverage, 32 of 32 tests in dev and prod). The branch also includes a small chore commit at lib/index.js:233 through 237 that drops three @ts-expect-error directives that no longer match the current markdown-table types and that fail tsc on upstream main as written. Without that cleanup the package's own test script does not pass, so the chore commit is required to land before this one. Squash or split per maintainer preference. Bench (local harness, 3-run median-of-medians, GFM tokenizer and mdast extensions stacked). Numbers below are the parse time delta versus an unmodified upstream main; positive numbers mean slower, negative numbers mean faster. Inputs: a 4-column 1000-row table and a 20-column 100-row table. Tables sit at low mdast-layer ratios on the baseline (1.08 and 1.09 respectively); the bottleneck for tables is in micromark-extension-gfm-table, not in this util. Numbers should be read as "no regression" rather than "credible win." Effect of pairing this branch with PR #53: - 4 cols x 1000 rows: PR #53 alone: 4.7 percent slower. PR #53 with this branch alongside: 6.2 percent slower. - 20 cols x 100 rows: PR #53 alone: 5.0 percent slower. PR #53 with this branch alongside: 7.2 percent slower. Both rows sit inside the machine's drift band of 4 to 12 percent (main vs main with no code change in the same window). Pairing neither helps nor hurts table parses meaningfully, which matches what the baseline ratio predicts. The branch is included for shape consistency across the GFM stack rather than for a measurable per-table win.

Problem: this extension builds link nodes (and their nested text children) in two paths: the enter handler for literal autolinks during parse, and a post-parse find-and-replace transform that walks text nodes and produces link/text nodes from URL-shaped substrings. In both paths the literals are fresh objects with no position field; position is patched in afterwards. V8 treats the late assignment as a shape change, so every link and text node walks through two hidden classes instead of one. Core nodes already avoid this; extension nodes leave the tree with mixed layouts and tree walks lose monomorphic dispatch. Goal: declare position on every link and text literal this extension produces, both in the enter handler and in the post-parse transform, so V8 keeps a single hidden class per node type. Changes: - lib/index.js: trailing position: undefined on the link literal in enterLiteralAutolink. - lib/index.js: trailing position: undefined on the link and nested text literals returned from findUrl, plus the trailing-text node when splitUrl strips punctuation. - lib/index.js: trailing position: undefined on the link and nested text literals returned from findEmail. Notes: the transform-path literals at lines 175 through 180, 183, and 207 through 212 are returned from find-and-replace callbacks rather than going through this.enter(). A one-off probe (parsing 'visit www.example.com\n' with the GFM extension) confirmed that something further down the pipeline populates position on these nodes, so the pre-declared field gets overwritten with a real value and the test fixtures continue to assert real position objects. Pairs with the matching core change in syntax-tree/mdast-util-from-markdown#53; merged on its own this branch is a no-op. Validated with npm test (build, format, 100% coverage, 49 of 49 tests in dev and prod). Bench (local harness, 3-run median-of-medians, GFM tokenizer and mdast extensions stacked). Numbers below are the parse time delta versus an unmodified upstream main; positive numbers mean slower, negative numbers mean faster. Two scenarios exercise this code path. The first is 5000 bare URLs interleaved with surrounding text, which is the largest mdast-layer share in the GFM corpus on the baseline (full 319.6 ms, tokenize-only 142.8 ms; the gap between those is roughly 35 microseconds of mdast-layer work per URL, mostly node allocation and position bookkeeping). The second is the syntax-tree GFM-stack readme used as a real-world fixture, where the autolink resolver walks every text node. Effect of pairing this branch with PR #53: - 5000 bare URLs interleaved with text: PR #53 alone: 2.3 percent slower than baseline. PR #53 with this branch alongside: 2.5 percent FASTER than baseline. - syntax-tree GFM-stack readme: PR #53 alone: 29.4 percent slower than baseline. PR #53 with this branch alongside: 22.5 percent slower than baseline. The bare-URL scenario is the only configuration across the whole 5-repo paired set where the paired build runs faster than upstream main: pairing recovers 4.8 percentage points and crosses zero into a real win. The 6.9-point swing on the readme matches what the V8 hidden-class hypothesis predicts: with link and text literals sharing the core layout, every link in the resulting tree dispatches through one inline cache. The readme's 22.5 percent paired residual sits in the upper end of the machine's drift band (a separately measured drift floor of main against itself ran at 4 to 12 percent across most GFM inputs in the same window), which is consistent with no real regression on top of the change.

github-actions Bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels May 3, 2026

ChristianMurphy requested a review from Copilot May 3, 2026 13:57

ChristianMurphy added the 🏁 area/perf This affects performance label May 3, 2026

Copilot started reviewing on behalf of ChristianMurphy May 3, 2026 13:58 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

ChristianMurphy mentioned this pull request May 4, 2026

perf: pre-declare position on table factories syntax-tree/mdast-util-gfm-table#9

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(compile): pre-declare position in node factories#53

perf(compile): pre-declare position in node factories#53
ChristianMurphy wants to merge 1 commit intosyntax-tree:mainfrom
ChristianMurphy:perf/stable-node-shape

ChristianMurphy commented May 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

ChristianMurphy commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ChristianMurphy commented May 3, 2026

Initial checklist

Description of changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

ChristianMurphy commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants