Skip to content

fix: flip attention marker to avoid nested-emphasis fusion#74

Open
ChristianMurphy wants to merge 1 commit intomainfrom
fix/nested-attention-marker-flip
Open

fix: flip attention marker to avoid nested-emphasis fusion#74
ChristianMurphy wants to merge 1 commit intomainfrom
fix/nested-attention-marker-flip

Conversation

@ChristianMurphy
Copy link
Copy Markdown
Member

@ChristianMurphy ChristianMurphy commented Apr 23, 2026

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and discussions and couldn’t find anything or linked relevant results below
  • I made sure the docs are up to date
  • I included tests (or that’s not needed)

Description of changes

Problem: emphasis uses the configured marker unconditionally, so
emphasis > emphasis > text serializes as **a**, which re-parses
as strong > text. The same path erases the outer type from
strong > emphasis > text (***a*** re-parses as emphasis > strong)
and collapses any strict three-or-more nested emphasis chain
(***a*** re-parses as emphasis > strong rather than three
nested emphases).

#12 catalogued a broader family of related shapes:
***emphasis*in emphasis*, *a*_b__, a ***b*c d*, and variants
with different leading run lengths. As wooorm noted in the thread,
"pulling a thread somewhere will have something happen somewhere
entirely different": marker choice in one place interacts with
flanking rules elsewhere, and covering every shape needs
coordination beyond picking a single marker per node.

Scope: land the minimal change that fixes the shapes guaranteed to
drift under the current serializer, without regressing shapes where
CommonMark's attention algorithm already recovers the original tree
through fusion. Escape-based work on the remaining #12 shapes is
left for follow-up.

Approach: introduce lib/util/emphasis-marker.js. Both the
emphasis handler and its peek route marker selection through it,
keeping lookahead in container-phrasing consistent with what the
handler emits. The helper flips in two narrow situations:

  1. The emphasis is the only child of an attention parent (emphasis
    or strong), and both its opening and closing markers would sit
    immediately next to the parent's primary marker. Using the
    opposite marker (*_a_*, **_a_**) breaks the fusion into
    strong or em+strong.

  2. The emphasis sits at the top of a strict same-type chain of
    depth two or more (every link has exactly one emphasis child)
    when the primary marker is *. Three-deep emphasis only
    round-trips with _ on the outside, because _'s flanking
    rules are stricter than *'s. The rule is asymmetric on
    purpose: with primary _ the first rule's adjacency flip
    alternates correctly on its own.

Strong is never flipped. A run of four asterisks already pairs as
two strong delimiters, six as three, and so on, so strong
round-trips without help.

Journey (what was tried and why the scope narrowed):

  • An earlier iteration flipped strong too and regressed ~18 corpus
    fixtures whose nested-strong shapes relied on long fused runs of
    asterisks. Strong was dropped from the flip to recover them.

  • Flipping whenever info.before or info.after matches the
    primary caused cascading flips on paragraph-level attention
    siblings: [emphasis, strong, emphasis] serialized as
    _a___a__*a*, where _ + __ at the em/strong boundary
    re-tokenised as a single ___ run. The flip was narrowed to
    attention parents only.

  • Widening rule 1 to first-or-last-child of any attention parent
    fixed several roundtripping emphasis in emphasis edge case changes document structure #12 shapes (***emphasis*in emphasis*,
    ***x*y z*, ****x*y z*) but regressed ***a*a*-*
    (emphasis > [emphasis > [emphasis, text], text]): CommonMark's
    rule 17 uses the leading *** fusion to recover the three-deep
    structure, and the flip broke that recovery. The rule was
    tightened to only-child plus strict-chain.

  • The only-child formulation plus the strict-chain rule is the
    widest version verified to cause zero transitions from ok to
    finding across 600 corpus files (commonmark, gfm, all
    configurations).

Edge cases covered by new tests:

  • Plain emphasis and strong, with primaries * and _, showing
    the helper is inert on non-nested attention.
  • emphasis > emphasis with each of primary * and _, yielding
    *_a_* and _*a*_.
  • strong > emphasis yielding **_a_**.
  • emphasis > strong and strong > strong preserved at ***a***
    and ****a****, proving strong is untouched.
  • Strict three-deep emphasis chains with both primaries, both
    yielding _*_a_*_ (chain flip vs adjacency flip arrive at the
    same output by different routes).
  • Emphasis parents with more than one child, demonstrating the
    only-child guard preserves shapes vanilla handles.
  • Middle-sibling emphasis, confirming no flip at non-boundary
    positions.
  • Top-level [emphasis, strong, emphasis] round-trip preserved.
  • ***a*a*-* fusion shape explicitly preserved as a regression
    guard against future widening.
  • Round-trips for parsed *_a_*, parsed _*_a_*_, synthesised
    emphasis > emphasis, synthesised strong > emphasis, and a
    three-deep chain preceded by a text sibling.

Refs: #12

Disclosure: written with AI assistance

@ChristianMurphy ChristianMurphy requested a review from Copilot April 23, 2026 23:05
@github-actions github-actions Bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Apr 23, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes markdown round-tripping for nested attention (emphasis/strong) by choosing markers that avoid delimiter fusion and CommonMark re-tokenization, addressing the edge case described in #12.

Changes:

  • Added a new utility to pick * vs _ per emphasis/strong node based on surrounding adjacency and a three-deep boundary-chain rule.
  • Updated emphasis and strong handlers to use the new marker-selection utility.
  • Added tests for nested attention marker flipping and for some round-trip scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
lib/util/emphasis-strong-marker.js New marker-selection logic to avoid attention delimiter fusion and handle three-deep same-type boundary chains.
lib/handle/emphasis.js Uses the new utility to choose the emphasis marker at serialization time.
lib/handle/strong.js Uses the new utility to choose the strong marker at serialization time.
test/index.js Adds tests covering nested attention marker flipping and round-trip behavior for selected cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/handle/emphasis.js Outdated
Comment thread lib/handle/strong.js
Comment thread test/index.js
Problem: emphasis uses the configured marker unconditionally, so
`emphasis > emphasis > text` serializes as `**a**`, which re-parses
as `strong > text`. The same path erases the outer type from
`strong > emphasis > text` (`***a***` re-parses as `emphasis > strong`)
and collapses any strict three-or-more nested emphasis chain
(`***a***` re-parses as `emphasis > strong` rather than three
nested emphases).

GH-12 catalogued a broader family of related shapes:
`***emphasis*in emphasis*`, `*a*_b__`, `a ***b*c d*`, and variants
with different leading run lengths. As wooorm noted in the thread,
"pulling a thread somewhere will have something happen somewhere
entirely different": marker choice in one place interacts with
flanking rules elsewhere, and covering every shape needs
coordination beyond picking a single marker per node.

Scope: land the minimal change that fixes the shapes guaranteed to
drift under the current serializer, without regressing shapes where
CommonMark's attention algorithm already recovers the original tree
through fusion. Escape-based work on the remaining GH-12 shapes is
left for follow-up.

Approach: introduce `lib/util/emphasis-marker.js`. Both the
emphasis handler and its peek route marker selection through it,
keeping lookahead in `container-phrasing` consistent with what the
handler emits. The helper flips in two narrow situations:

1.  The emphasis is the only child of an attention parent (emphasis
    or strong), and both its opening and closing markers would sit
    immediately next to the parent's primary marker. Using the
    opposite marker (`*_a_*`, `**_a_**`) breaks the fusion into
    strong or em+strong.

2.  The emphasis sits at the top of a strict same-type chain of
    depth two or more (every link has exactly one emphasis child)
    when the primary marker is `*`. Three-deep emphasis only
    round-trips with `_` on the outside, because `_`'s flanking
    rules are stricter than `*`'s. The rule is asymmetric on
    purpose: with primary `_` the first rule's adjacency flip
    alternates correctly on its own.

Strong is never flipped. A run of four asterisks already pairs as
two strong delimiters, six as three, and so on, so strong
round-trips without help.

Journey (what was tried and why the scope narrowed):

- An earlier iteration flipped strong too and regressed ~18 corpus
  fixtures whose nested-strong shapes relied on long fused runs of
  asterisks. Strong was dropped from the flip to recover them.

- Flipping whenever `info.before` or `info.after` matches the
  primary caused cascading flips on paragraph-level attention
  siblings: `[emphasis, strong, emphasis]` serialized as
  `_a___a__*a*`, where `_` + `__` at the em/strong boundary
  re-tokenised as a single `___` run. The flip was narrowed to
  attention parents only.

- Widening rule 1 to first-or-last-child of any attention parent
  fixed several GH-12 shapes (`***emphasis*in emphasis*`,
  `***x*y z*`, `****x*y z*`) but regressed `***a*a*-*`
  (`emphasis > [emphasis > [emphasis, text], text]`): CommonMark's
  rule 17 uses the leading `***` fusion to recover the three-deep
  structure, and the flip broke that recovery. The rule was
  tightened to only-child plus strict-chain.

- The only-child formulation plus the strict-chain rule is the
  widest version verified to cause zero transitions from ok to
  finding across 600 corpus files (commonmark, gfm, all
  configurations).

Edge cases covered by new tests:

- Plain emphasis and strong, with primaries `*` and `_`, showing
  the helper is inert on non-nested attention.
- `emphasis > emphasis` with each of primary `*` and `_`, yielding
  `*_a_*` and `_*a*_`.
- `strong > emphasis` yielding `**_a_**`.
- `emphasis > strong` and `strong > strong` preserved at `***a***`
  and `****a****`, proving strong is untouched.
- Strict three-deep emphasis chains with both primaries, both
  yielding `_*_a_*_` (chain flip vs adjacency flip arrive at the
  same output by different routes).
- Emphasis parents with more than one child, demonstrating the
  only-child guard preserves shapes vanilla handles.
- Middle-sibling emphasis, confirming no flip at non-boundary
  positions.
- Top-level `[emphasis, strong, emphasis]` round-trip preserved.
- `***a*a*-*` fusion shape explicitly preserved as a regression
  guard against future widening.
- Round-trips for parsed `*_a_*`, parsed `_*_a_*_`, synthesised
  `emphasis > emphasis`, synthesised `strong > emphasis`, and a
  three-deep chain preceded by a text sibling.

Refs: #12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts how emphasis delimiters are selected during serialization to prevent CommonMark attention delimiter fusion from changing the parsed AST shape (notably for nested emphasis), and adds regression tests for the affected roundtrip cases.

Changes:

  • Add lib/util/emphasis-marker.js to centralize emphasis marker selection and selectively flip markers in specific nesting scenarios.
  • Update the emphasis handler (and its peek) to use the new marker-selection helper.
  • Expand the emphasis test suite with targeted cases and roundtrip guards for nested attention edge cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
test/index.js Adds new serialization and roundtrip tests covering nested emphasis/strong and known fusion edge cases.
lib/util/emphasis-marker.js Introduces helper to choose * vs _ for emphasis to avoid delimiter fusion changing parse results.
lib/handle/emphasis.js Routes emphasis delimiter selection through the new helper for both handler and peek behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +58 to +63
if (primary === '*' && strictChainDepth(node) >= 2) return other

return primary
}

/**
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation is correct: strictChainDepth(node) >= 2 does match multiple nodes in a strict chain of depth 4 or deeper, which I'll grant wasn't by explicit design. But applying the proposed gating is a regression, not an improvement. Comparison across depths:

depth current output current reparses as copilot output copilot reparses as
2 *_a_* em > em > text("a") *_a_* em > em > text("a")
3 _*_a_*_ em > em > em > text("a") _*_a_*_ same ✅
4 __*_a_*__ strong > em > em > text("a") _*_*a*_*_ [em(text("*")), em("a"), em(text("*"))]
5 ___*_a_*___ em > strong > em > em > text("a") _*_*_a_*_*_ 4 flat em with literal * text
6 ____*_a_*____ strong > strong > em > em > text("a") _*_*_*a*_*_*_ 4 flat em with literal * text

Both approaches drift at depth 4+, CommonMark has no round-tripping representation for a strict 4-deep emphasis chain. But the current output is strictly more faithful to the input tree:

  • The text content ("a") is preserved.
  • Attention nesting is preserved (the inner em > em survives).
  • Only the outermost element type changes (em → strong).

The proposed fix:

  • Flattens the nesting entirely into sibling em nodes.
  • Injects literal * or _ characters into text nodes that weren't in the original tree (content corruption).

So the __...__ at the outer boundary looks alarming, but it's producing the least-bad degradation at depth 4+. I'd rather keep the accidental over-match and document it than trade it for a drift that corrupts content.

Copy link
Copy Markdown
Member

@Murderlon Murderlon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks high-quality 👍 . Double checked with Devin review as well and no issues found: https://app.devin.ai/review/syntax-tree/mdast-util-to-markdown/pull/74

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤞 phase/open Post is being triaged manually

Development

Successfully merging this pull request may close these issues.

3 participants