Skip to content

feat(core): DOCX to Markdown converter#595

Open
jedrazb wants to merge 12 commits into
mainfrom
feat/docx-to-markdown
Open

feat(core): DOCX to Markdown converter#595
jedrazb wants to merge 12 commits into
mainfrom
feat/docx-to-markdown

Conversation

@jedrazb
Copy link
Copy Markdown
Contributor

@jedrazb jedrazb commented May 24, 2026

Summary

New subpath @eigenpal/docx-editor-core/markdown with four entry functions and a drop-a-docx playground.

import { toMarkdown, toMarkdownPaged, toMarkdownAsync } from '@eigenpal/docx-editor-core/markdown';

const { markdown, images, warnings } = await toMarkdown(buffer);
const { pages, combined } = await toMarkdownPaged(buffer);
const result = await toMarkdownAsync(buffer, {
  imageHandler: async (ref) => `![${await describe(ref.base64)}]()`,
});
  • Continuous and paged. toMarkdown is one string; toMarkdownPaged returns one entry per Word page plus a combined string with <!-- page N --> separators. Page boundaries come from Word's pre-baked pagination hints (renderedPageBreakBefore, explicit w:br type="page", section breaks of type nextPage/evenPage/oddPage). No canvas dep.
  • Async image handlers. toMarkdownAsync and toMarkdownPagedAsync wrap the base functions and await an imageHandler(ref, ctx) callback per image. The handler's return string substitutes the default ![alt](./images/...) reference. Errors per image push a warning, never throw.
  • Options. annotations (html | pandoc | strip) for things markdown can't express (comments, tracked changes, headers/footers). trackedChanges (clean | annotate). comments (strip | inline | sidecar). hyperlinks (inline | reference). footnotes (inline | end). headerFooter (paged only: strip | first-page | all). Custom imagePath(info) for naming. Defaults are fidelity-first.
  • Images. Returned as Map<virtualPath, ImageRef> with raw bytes + base64 + data URL + metadata. Callers decide what to do (data URL, blob upload, vision-model description, drop). A media file referenced N times is registered once.
  • Playground. examples/markdown-playground: drop a .docx, see rendered Word on the left, markdown on the right, toggle every option live.

Test plan

  • bun run typecheck clean across the workspace
  • bun run format clean
  • bun run check:parity-contract passes (no React/Vue surface touched)
  • bun run api:extract updated; markdown.api.md snapshot committed
  • Smoke tested against examples/shared/sample.docx, screenshots/demo.docx, screenshots/pr320/multi-section.docx for continuous, paged, async-with-image-handler, and custom-imagePath
  • 72/72 playwright formatting specs pass (no regression in the editor pipeline)
  • Visual verification of the playground in Chrome (continuous + paged, all option dropdowns)

Run the playground locally:

bun run --filter './examples/markdown-playground' dev
# http://localhost:5180

…markdown

Four entry functions, sync from a parsed Document, async from raw bytes:

  toMarkdown            continuous markdown
  toMarkdownPaged       one entry per Word page + a `combined` string
                        with `<!-- page N -->` separators
  toMarkdownAsync       toMarkdown + per-image handler callback
  toMarkdownPagedAsync  toMarkdownPaged + per-image handler callback

Page boundaries come from Word's pre-baked pagination hints
(renderedPageBreakBefore, explicit `w:br type="page"`, section breaks of
type nextPage/evenPage/oddPage). No canvas dep.

Options: annotations (html | pandoc | strip), trackedChanges (clean |
annotate), comments (strip | inline | sidecar), hyperlinks (inline |
reference), footnotes (inline | end), headerFooter (paged-only: strip |
first-page | all), plus a custom imagePath(info) callback. Images are
returned as a Map<virtualPath, ImageRef> so callers decide what to do
with bytes (data URL, blob upload, vision-model description, drop).

A live drop-a-docx playground ships at examples/markdown-playground:

  bun run --filter './examples/markdown-playground' dev
@vercel
Copy link
Copy Markdown

vercel Bot commented May 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docx-editor Error Error May 25, 2026 8:13am

Request Review

@eigenpal-release-pal
Copy link
Copy Markdown
Contributor

eigenpal-release-pal Bot commented May 24, 2026

All contributors have signed the CLA ✍️ ✅

Posted by the CLA bot.

GFM tables don't support colspan/rowspan, so merged-cell and nested-table
DOCX content previously lost structure (one warning per limitation,
content squashed into the first cell). Detection now runs per-table: a
table with any `gridSpan > 1`, `vMerge`, or nested `<w:tbl>` falls back
to inline `<table>` with proper `colspan`/`rowspan`. Simple tables stay
GFM.

Inline marks inside HTML cells use HTML tags (`<strong>`, `<em>`,
`<code>`, `<s>`, `<u>`, `<a>`) because markdown is not parsed inside
HTML blocks. Bookmark markers, field chrome, and soft hyphens drop as
expected. Hyperlinks resolve `href` or `#anchor` the same way GFM does.

Net effect on `screenshots/demo.docx`: 26 -> 0 warnings, all merged-cell
and nested-table content now renders correctly on GitHub/Notion/etc.
32 focused bun:test cases under packages/core/src/markdown/markdown.test.ts.
Tests build Document literals directly so each case pins one renderer
decision without depending on the parser or any .docx fixture.

Coverage:
  - basics: Document input, paragraph join, empty doc warning, bad-input error
  - inline marks: bold/italic/strike, mark nesting, monospace whitelist
  - block structure: heading levels from styleId, bullet vs ordered lists
  - hyperlinks: inline + reference modes, orphan-link warning
  - tracked changes: annotate/clean modes, pandoc annotations
  - comments: inline/strip/sidecar modes
  - tables: GFM path, HTML fallback for gridSpan, rowspan from vMerge,
    nested table recursion
  - paged: renderedPageBreakBefore, sectionStart, combined separator,
    empty-doc returns zero pages, header/footer emission
  - images: media-path dedupe, async substitution, custom imagePath,
    handler-error warning

Fix a latent bug found by the nested-table test: a nested table inside
an HTML cell was falling through to the GFM renderer, which doesn't
render inside HTML blocks. Force HTML rendering for nested tables in
HTML cells.
Found in the wild on a 6-page banking template (template (1).docx).

1. Empty marker paragraphs caused double-counting.
   Word's idiom for an explicit page break is an empty paragraph whose
   only run is a `<w:br type="page"/>`, plus `renderedPageBreakBefore`
   on both that paragraph AND the next one. The splitter counted both
   signals, turning a 6-page doc into 11 pages. Detect "pure break
   paragraphs" (no visible text, only a page-break run), consume them
   as transitions, and drop them from the output.

2. Header/footer never emitted.
   `resolveHeaderFooter` only looked at `Section.headers` (the
   HeaderFooterType-keyed map), which is empty in practice. The Word-
   authored layout puts headers in `DocxPackage.headers` keyed by rId,
   with the section's `headerReferences` mapping HeaderFooterType to
   rId. Resolve via the reference chain, with the package map as a
   last-resort fallback. Same for footers.

3. Header/footer images failed to resolve.
   Headers have their own relationships file (word/_rels/header1.xml.rels)
   whose rIds do NOT live in `pkg.relationships`. The parser already
   inlines the image into `image.src` as a data URL for this exact
   reason. Add a fallback in the drawing renderer: if the rels lookup
   misses but `image.src` is set, emit `![alt](src)` (or `<img src>`
   in HTML cells).

4. Markdown inside `<header>`/`<footer>` wasn't parsed.
   CommonMark and GFM stop parsing markdown at HTML block boundaries,
   so `<header>**Bold**</header>` rendered as literal `**Bold**`. Use
   blank lines between the wrapper tag and content to make each side
   parse as its own HTML block; the inner content then parses as
   normal markdown.

Playground:
- Default method is `paged` (the more interesting case on load).
- Conversion re-runs on every option change. The parser detaches the
  ArrayBuffer it consumes, so re-runs were silently reading from a
  consumed view. Clone a fresh `Uint8Array` per call.
- `Rendered`/`Raw` toggle in the markdown pane using `react-markdown`
  + `remark-gfm` + `rehype-raw`. Styles match a clean Word-doc look.
Lets non-DOM runtimes (Node, Bun, workers) inject a Canvas2D-compatible
context (`@napi-rs/canvas`, `canvaskit-wasm`, etc.) into the measurement
module. Internal layout primitives unchanged; just adds the injection
point + re-exports it.

Power-user extension point for plugging the full layout engine into a
headless markdown converter. The default heuristic page splitter in
toMarkdownPaged stays the production path because Word's pre-baked
pagination (renderedPageBreakBefore) is more accurate than re-running
layout in Node where the doc's authoring fonts aren't available.
Correctness:
- Break index.ts ↔ paged.ts import cycle by moving isDocument and
  badInputError into a new input.ts module.
- Tighten isDocument to reject `package.document === null` (was passing
  because `typeof null === 'object'`).
- countRowSpan now aligns vMerge cells by cumulative grid column, not
  array index, so vertical merges below a horizontal merge in a prior
  row resolve correctly.
- renderHtmlInline handles bookmark/comment-range/move-range markers
  and inlineSdt content. Previously these silently dropped.
- Async substitution now also rewrites HTML `<img>` references, not just
  markdown image syntax. Images inside HTML-mode table cells now flow
  through the user's imageHandler.
- Empty heading paragraphs render as empty string instead of a bare `#`.
- Quote style detection: exact match on `Quote` / `IntenseQuote` styles
  (no longer accidentally matching `NoQuote` or `BlockQuoteCustom`).
- Multi-section header fallback: only fire the package-map fallback
  when there's exactly one header in the doc. Multi-section docs
  without proper refs no longer surface the wrong section's header.

Simplification:
- Drop `RenderTableOptions` interface — never passed; row-range slicing
  was reserved for a layout-driven mode that never landed.
- Drop `manualBase64Encode` + alphabet (~30 lines). Buffer/btoa cover
  every supported runtime.
- Extract shared trailers (footnotes / hyperlink refs / comments
  sidecar) into trailers.ts. Was duplicated between index.ts and
  paged.ts.
- Extract `newContext` into context.ts. Both entry points now use the
  same builder.
- Drop `footnotes: 'inline'` from public types — documented but never
  implemented. Pre-1.0 cleanup.
- Collapse the 5 wrap functions in annotations.ts to a single helper
  parameterized by tag/class/mode.
- Drop dead `commentById`, `void cellIdx`, `as Run` casts after
  type-narrowing, `(item as { content?: ParagraphContent[] })` paranoia.
- Route all renderer warnings through `pushWarning` so the documented
  "deduped" contract is actually true.

Tests:
- All 32 unit tests still pass.
- Test for inline comment wrapper updated to be attribute-order-
  agnostic so future attribute reorderings don't break it.
…ignals

Programmatic DOCX files (template tools, freshly generated reports) often
ship without `renderedPageBreakBefore`, `<w:br type="page"/>`, section
breaks, or `pageBreakBefore` paragraph properties. The heuristic page
splitter has nothing to act on and the whole body lands on a single
page.

That's the correct mathematical answer, but it surprises callers who
expect a multi-page result. Detect the case: when the body has more
than ~25 paragraphs and zero pagination signals fired, push a warning
explaining what happened and suggesting the two fixes (open in Word and
resave to bake in the cache, or use toMarkdown for continuous output).

Short docs (< 25 paragraphs) don't warn — a 3-line memo really is one
page. Docs with any signal don't warn — the splitter worked.

2 new tests cover the warn / don't-warn boundaries.
The heuristic page splitter relies on Word's pre-baked pagination cache
(`renderedPageBreakBefore`) plus explicit break signals. Docs that have
never been opened in Word — programmatically generated, fresh template
output — carry none of these and collapse to a single page.

This wires the existing core layout engine as an opt-in fallback so the
same Document → ProseDoc → FlowBlock → measureBlocks → layoutDocument
pipeline that paginates the live editor in the browser also paginates
headless markdown output in Node and Bun.

New option on `PagedMarkdownOptions`: `useLayoutEngine: boolean |
'fallback'`. Only honored by the async paged variant
(`toMarkdownPagedAsync`) because the layout engine needs a Canvas2D
context. In the browser the DOM provides one; in Node we lazy-import
`@napi-rs/canvas` as an optional peer dep. Falls back silently to the
heuristic on any failure (canvas unavailable, layout error).

`useLayoutEngine: 'fallback'` runs the heuristic first and only
re-paginates when the heuristic produced a single page on a substantial
doc — the recommended production setting. `useLayoutEngine: true`
always runs the layout engine.

Caveats:
- Won't byte-match Word; expect ±1 page on cacheless docs because
  Word's renderer differs from layout-engine + Skia metrics.
- @napi-rs/canvas is a native binary; some serverless runtimes (CF
  Workers) won't load it. The fallback degrades gracefully.

Also:
- `imageHandler` is now optional on `toMarkdownAsync` /
  `toMarkdownPagedAsync` so callers can use the async variants purely
  for layout-engine pagination without supplying an image handler.
- New `renderFromGroups` helper in paged.ts so both the heuristic
  (groups from splitIntoPages) and the fallback (groups from
  headlessLayout.computePagedGroups) share rendering code.
- Bumped agents tsconfig lib to include `DOM.Iterable` — pre-existing
  files use iterator syntax on NodeList that now reaches the agents
  package via the new fallback import chain.

Test: 60-paragraph cacheless doc went from 1 page (heuristic) to 4
pages (layout fallback). Cached doc still gives 6 pages because the
fallback only triggers when the heuristic indicates no pagination
signals fired.
…ss layout

The browser editor's layout engine produces Word-matching pagination
because Chromium loads Google's Croscore Office-font substitutes
(Carlito for Calibri, Caladea for Cambria, Arimo for Arial, Tinos for
Times New Roman, Cousine for Courier New) via `<link>` injection at
parse time, and `buildFontString`'s CSS cascade resolves to those
substitutes when measureText runs.

In Node + Bun, none of that happens — substituted fonts aren't
available, the cascade falls through to whatever Skia picks as default,
and measurements diverge. Even adding `@napi-rs/canvas` doesn't fix it
because Skia has no Office fonts and no Croscore substitutes
out-of-the-box.

New `officeFonts.ts` module: on first `computePagedGroups` call, download
five Croscore TTFs from canonical GitHub URLs into a temp-dir cache, then
register each TTF under both its real name AND every Office name it
substitutes for (Carlito → Calibri, Calibri Light, Aptos, Aptos Display;
Arimo → Arial, Helvetica; Tinos → Times New Roman, Georgia; etc.).

That makes the headless layout engine produce stable, consistent
metrics across runs and machines.

Caveats (called out in the JSDoc):
- Skia and Blink still compute glyph kerning/bearings slightly
  differently, so the layout engine in Node can diverge by ~1 page from
  the same code in the browser on the same doc. The heuristic remains
  the gold standard when Word has touched the doc (matches Word
  exactly via renderedPageBreakBefore cache); the layout engine
  fallback is for cache-less docs where "approximate" beats
  "one page".
- Download fails silently in offline / sandboxed environments; the
  cascade then falls back to system defaults with a quality hit.

Tests: 35/35 pass. Tested end-to-end with the user's 67-paragraph
template.docx (heuristic still 6 pages matching Word; layout engine
forced produces 7 ±1 with substitutes, consistent across runs).
Spent a session investigating why the layout-engine fallback in Node
diverges by 1 page from Word's pagination on a specific cache-less
doc. Findings:

- Word's cache (renderedPageBreakBefore) gives exact pagination.
  Heuristic reads it directly: byte-exact for cache-bearing docs.
- Browser editor runs the same layout engine and matches Word, but
  it benefits from Chromium's tightly integrated Skia + HarfBuzz +
  Blink line-break logic, which approximates Word's text shaper.
- Pure @napi-rs/canvas (vanilla Skia) measures line widths slightly
  differently than Chromium. Per-line deltas of a few percent
  accumulate over a 60+ paragraph doc and can tip content across
  page boundaries. On the test doc, 19 blocks measured 1152px in
  Node vs ~978px in Word/browser — a 15% gap consistent with
  text-shaping differences.

Bridging that gap headlessly requires either:
- Headless Chrome (puppeteer): expensive, heavy install.
- A from-scratch text shaper matching Word's metrics: out of scope.

Updated the JSDoc on `useLayoutEngine` to describe this honestly:
caveat names the actual root cause (text shaping, not glyph metrics),
recommends 'fallback' as the production setting, and points at the
two known paths to exact matching (open-and-resave in Word, or
headless Chrome).
Ran code-review, simplifier, and DX agents. Acted on the high-impact findings.

Correctness:
- headlessLayout.computePagedGroups was misaligned for docs with anchored
  textboxes (toProseDoc emits multiple PM nodes per source paragraph;
  the old parallel-walk assumed 1:1 mapping). Rewrote to map by
  ProseMirror positions: snapshot each source body block's pmStart from
  toProseDoc's output and bisect for each fragment's pmStart at render
  time. Handles textboxes, BlockSdt, multi-node-per-source cases.
- canvasReady and registerOnce memos no longer poison the process on
  failure. A transient network glitch during font download is retried
  on the next call instead of permanently disabling the layout engine.
- registerOfficeSubstitutes returns success only when at least one font
  actually registered.
- Canvas shrunk from 2000x2000 to 1x1 — measurement doesn't paint.

Simplifications:
- Merged input.ts + context.ts + trailers.ts into one internals.ts
  (each was sub-50 lines, all consumed only by index.ts/paged.ts/
  async.ts). Markdown dir goes 16 → 14 files.
- Dropped MarkdownOptionsBase from public type exports (consumers use
  MarkdownOptions / PagedMarkdownOptions; the base interface is an
  internal building block).
- Dropped `footnotes: 'end'` single-value option from the public types.
  Reserved for future expansion; documenting placement in prose for now.
- Named magic numbers: DEFAULT_PAGE_WIDTH_TWIPS, DEFAULT_PAGE_HEIGHT_TWIPS,
  DEFAULT_MARGIN_TWIPS.

DX:
- Less aggressive `<` escape: only escapes plausible HTML tag starts
  (`<\/?[A-Za-z][\\w-]*(?=[\\s/>])`). Prose like "x < y" stays untouched.
- badInputError special-cases strings: when a caller passes a filename,
  the error message points them at fs.readFile().
- Core README leads with DOCX→Markdown (was leading with editor).
  Added a "Which function do I want?" decision table and a comparison
  table vs Pandoc/mammoth.
- Server-side pagination section names the @napi-rs/canvas peer dep
  install command up front.
- Root README markdown section gets a real code snippet (was just
  prose).

Tests: 35/35 pass.
…rface list

CI was failing two exports-map invariants:
- typesVersions had no `markdown` key, so consumers on TS <4.7 wouldn't
  resolve types for `@eigenpal/docx-editor-core/markdown`.
- The "curated surface" test pins the full set of approved subpaths;
  `./markdown` wasn't on the list and tripped the silent-growth guard.

Add the typesVersions entry next to `agent` (the closest analog: a
non-experimental headless subpath) and add `./markdown` to the approved
set in the test fixture.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant