fix: flip attention marker to avoid nested-emphasis fusion#74
fix: flip attention marker to avoid nested-emphasis fusion#74ChristianMurphy wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes markdown round-tripping for nested attention (emphasis/strong) by choosing markers that avoid delimiter fusion and CommonMark re-tokenization, addressing the edge case described in #12.
Changes:
- Added a new utility to pick
*vs_per emphasis/strong node based on surrounding adjacency and a three-deep boundary-chain rule. - Updated
emphasisandstronghandlers to use the new marker-selection utility. - Added tests for nested attention marker flipping and for some round-trip scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
lib/util/emphasis-strong-marker.js |
New marker-selection logic to avoid attention delimiter fusion and handle three-deep same-type boundary chains. |
lib/handle/emphasis.js |
Uses the new utility to choose the emphasis marker at serialization time. |
lib/handle/strong.js |
Uses the new utility to choose the strong marker at serialization time. |
test/index.js |
Adds tests covering nested attention marker flipping and round-trip behavior for selected cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem: emphasis uses the configured marker unconditionally, so `emphasis > emphasis > text` serializes as `**a**`, which re-parses as `strong > text`. The same path erases the outer type from `strong > emphasis > text` (`***a***` re-parses as `emphasis > strong`) and collapses any strict three-or-more nested emphasis chain (`***a***` re-parses as `emphasis > strong` rather than three nested emphases). GH-12 catalogued a broader family of related shapes: `***emphasis*in emphasis*`, `*a*_b__`, `a ***b*c d*`, and variants with different leading run lengths. As wooorm noted in the thread, "pulling a thread somewhere will have something happen somewhere entirely different": marker choice in one place interacts with flanking rules elsewhere, and covering every shape needs coordination beyond picking a single marker per node. Scope: land the minimal change that fixes the shapes guaranteed to drift under the current serializer, without regressing shapes where CommonMark's attention algorithm already recovers the original tree through fusion. Escape-based work on the remaining GH-12 shapes is left for follow-up. Approach: introduce `lib/util/emphasis-marker.js`. Both the emphasis handler and its peek route marker selection through it, keeping lookahead in `container-phrasing` consistent with what the handler emits. The helper flips in two narrow situations: 1. The emphasis is the only child of an attention parent (emphasis or strong), and both its opening and closing markers would sit immediately next to the parent's primary marker. Using the opposite marker (`*_a_*`, `**_a_**`) breaks the fusion into strong or em+strong. 2. The emphasis sits at the top of a strict same-type chain of depth two or more (every link has exactly one emphasis child) when the primary marker is `*`. Three-deep emphasis only round-trips with `_` on the outside, because `_`'s flanking rules are stricter than `*`'s. The rule is asymmetric on purpose: with primary `_` the first rule's adjacency flip alternates correctly on its own. Strong is never flipped. A run of four asterisks already pairs as two strong delimiters, six as three, and so on, so strong round-trips without help. Journey (what was tried and why the scope narrowed): - An earlier iteration flipped strong too and regressed ~18 corpus fixtures whose nested-strong shapes relied on long fused runs of asterisks. Strong was dropped from the flip to recover them. - Flipping whenever `info.before` or `info.after` matches the primary caused cascading flips on paragraph-level attention siblings: `[emphasis, strong, emphasis]` serialized as `_a___a__*a*`, where `_` + `__` at the em/strong boundary re-tokenised as a single `___` run. The flip was narrowed to attention parents only. - Widening rule 1 to first-or-last-child of any attention parent fixed several GH-12 shapes (`***emphasis*in emphasis*`, `***x*y z*`, `****x*y z*`) but regressed `***a*a*-*` (`emphasis > [emphasis > [emphasis, text], text]`): CommonMark's rule 17 uses the leading `***` fusion to recover the three-deep structure, and the flip broke that recovery. The rule was tightened to only-child plus strict-chain. - The only-child formulation plus the strict-chain rule is the widest version verified to cause zero transitions from ok to finding across 600 corpus files (commonmark, gfm, all configurations). Edge cases covered by new tests: - Plain emphasis and strong, with primaries `*` and `_`, showing the helper is inert on non-nested attention. - `emphasis > emphasis` with each of primary `*` and `_`, yielding `*_a_*` and `_*a*_`. - `strong > emphasis` yielding `**_a_**`. - `emphasis > strong` and `strong > strong` preserved at `***a***` and `****a****`, proving strong is untouched. - Strict three-deep emphasis chains with both primaries, both yielding `_*_a_*_` (chain flip vs adjacency flip arrive at the same output by different routes). - Emphasis parents with more than one child, demonstrating the only-child guard preserves shapes vanilla handles. - Middle-sibling emphasis, confirming no flip at non-boundary positions. - Top-level `[emphasis, strong, emphasis]` round-trip preserved. - `***a*a*-*` fusion shape explicitly preserved as a regression guard against future widening. - Round-trips for parsed `*_a_*`, parsed `_*_a_*_`, synthesised `emphasis > emphasis`, synthesised `strong > emphasis`, and a three-deep chain preceded by a text sibling. Refs: #12
121fc17 to
9f155f4
Compare
There was a problem hiding this comment.
Pull request overview
This PR adjusts how emphasis delimiters are selected during serialization to prevent CommonMark attention delimiter fusion from changing the parsed AST shape (notably for nested emphasis), and adds regression tests for the affected roundtrip cases.
Changes:
- Add
lib/util/emphasis-marker.jsto centralize emphasis marker selection and selectively flip markers in specific nesting scenarios. - Update the
emphasishandler (and itspeek) to use the new marker-selection helper. - Expand the emphasis test suite with targeted cases and roundtrip guards for nested attention edge cases.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| test/index.js | Adds new serialization and roundtrip tests covering nested emphasis/strong and known fusion edge cases. |
| lib/util/emphasis-marker.js | Introduces helper to choose * vs _ for emphasis to avoid delimiter fusion changing parse results. |
| lib/handle/emphasis.js | Routes emphasis delimiter selection through the new helper for both handler and peek behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (primary === '*' && strictChainDepth(node) >= 2) return other | ||
|
|
||
| return primary | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
observation is correct: strictChainDepth(node) >= 2 does match multiple nodes in a strict chain of depth 4 or deeper, which I'll grant wasn't by explicit design. But applying the proposed gating is a regression, not an improvement. Comparison across depths:
| depth | current output | current reparses as | copilot output | copilot reparses as |
|---|---|---|---|---|
| 2 | *_a_* |
em > em > text("a") ✅ |
*_a_* |
em > em > text("a") ✅ |
| 3 | _*_a_*_ |
em > em > em > text("a") ✅ |
_*_a_*_ |
same ✅ |
| 4 | __*_a_*__ |
strong > em > em > text("a") |
_*_*a*_*_ |
[em(text("*")), em("a"), em(text("*"))] |
| 5 | ___*_a_*___ |
em > strong > em > em > text("a") |
_*_*_a_*_*_ |
4 flat em with literal * text |
| 6 | ____*_a_*____ |
strong > strong > em > em > text("a") |
_*_*_*a*_*_*_ |
4 flat em with literal * text |
Both approaches drift at depth 4+, CommonMark has no round-tripping representation for a strict 4-deep emphasis chain. But the current output is strictly more faithful to the input tree:
- The text content (
"a") is preserved. - Attention nesting is preserved (the inner
em > emsurvives). - Only the outermost element type changes (em → strong).
The proposed fix:
- Flattens the nesting entirely into sibling
emnodes. - Injects literal
*or_characters into text nodes that weren't in the original tree (content corruption).
So the __...__ at the outer boundary looks alarming, but it's producing the least-bad degradation at depth 4+. I'd rather keep the accidental over-match and document it than trade it for a drift that corrupts content.
Murderlon
left a comment
There was a problem hiding this comment.
Looks high-quality 👍 . Double checked with Devin review as well and no issues found: https://app.devin.ai/review/syntax-tree/mdast-util-to-markdown/pull/74
Initial checklist
Description of changes
Problem: emphasis uses the configured marker unconditionally, so
emphasis > emphasis > textserializes as**a**, which re-parsesas
strong > text. The same path erases the outer type fromstrong > emphasis > text(***a***re-parses asemphasis > strong)and collapses any strict three-or-more nested emphasis chain
(
***a***re-parses asemphasis > strongrather than threenested emphases).
#12 catalogued a broader family of related shapes:
***emphasis*in emphasis*,*a*_b__,a ***b*c d*, and variantswith different leading run lengths. As wooorm noted in the thread,
"pulling a thread somewhere will have something happen somewhere
entirely different": marker choice in one place interacts with
flanking rules elsewhere, and covering every shape needs
coordination beyond picking a single marker per node.
Scope: land the minimal change that fixes the shapes guaranteed to
drift under the current serializer, without regressing shapes where
CommonMark's attention algorithm already recovers the original tree
through fusion. Escape-based work on the remaining #12 shapes is
left for follow-up.
Approach: introduce
lib/util/emphasis-marker.js. Both theemphasis handler and its peek route marker selection through it,
keeping lookahead in
container-phrasingconsistent with what thehandler emits. The helper flips in two narrow situations:
The emphasis is the only child of an attention parent (emphasis
or strong), and both its opening and closing markers would sit
immediately next to the parent's primary marker. Using the
opposite marker (
*_a_*,**_a_**) breaks the fusion intostrong or em+strong.
The emphasis sits at the top of a strict same-type chain of
depth two or more (every link has exactly one emphasis child)
when the primary marker is
*. Three-deep emphasis onlyround-trips with
_on the outside, because_'s flankingrules are stricter than
*'s. The rule is asymmetric onpurpose: with primary
_the first rule's adjacency flipalternates correctly on its own.
Strong is never flipped. A run of four asterisks already pairs as
two strong delimiters, six as three, and so on, so strong
round-trips without help.
Journey (what was tried and why the scope narrowed):
An earlier iteration flipped strong too and regressed ~18 corpus
fixtures whose nested-strong shapes relied on long fused runs of
asterisks. Strong was dropped from the flip to recover them.
Flipping whenever
info.beforeorinfo.aftermatches theprimary caused cascading flips on paragraph-level attention
siblings:
[emphasis, strong, emphasis]serialized as_a___a__*a*, where_+__at the em/strong boundaryre-tokenised as a single
___run. The flip was narrowed toattention parents only.
Widening rule 1 to first-or-last-child of any attention parent
fixed several roundtripping emphasis in emphasis edge case changes document structure #12 shapes (
***emphasis*in emphasis*,***x*y z*,****x*y z*) but regressed***a*a*-*(
emphasis > [emphasis > [emphasis, text], text]): CommonMark'srule 17 uses the leading
***fusion to recover the three-deepstructure, and the flip broke that recovery. The rule was
tightened to only-child plus strict-chain.
The only-child formulation plus the strict-chain rule is the
widest version verified to cause zero transitions from ok to
finding across 600 corpus files (commonmark, gfm, all
configurations).
Edge cases covered by new tests:
*and_, showingthe helper is inert on non-nested attention.
emphasis > emphasiswith each of primary*and_, yielding*_a_*and_*a*_.strong > emphasisyielding**_a_**.emphasis > strongandstrong > strongpreserved at***a***and
****a****, proving strong is untouched.yielding
_*_a_*_(chain flip vs adjacency flip arrive at thesame output by different routes).
only-child guard preserves shapes vanilla handles.
positions.
[emphasis, strong, emphasis]round-trip preserved.***a*a*-*fusion shape explicitly preserved as a regressionguard against future widening.
*_a_*, parsed_*_a_*_, synthesisedemphasis > emphasis, synthesisedstrong > emphasis, and athree-deep chain preceded by a text sibling.
Refs: #12
Disclosure: written with AI assistance