feat(core): DOCX to Markdown converter by jedrazb · Pull Request #595 · eigenpal/docx-editor

jedrazb · 2026-05-24T15:19:07Z

Summary

New subpath @eigenpal/docx-editor-core/markdown with four entry functions and a drop-a-docx playground.

import { toMarkdown, toMarkdownPaged, toMarkdownAsync } from '@eigenpal/docx-editor-core/markdown';

const { markdown, images, warnings } = await toMarkdown(buffer);
const { pages, combined } = await toMarkdownPaged(buffer);
const result = await toMarkdownAsync(buffer, {
  imageHandler: async (ref) => `![${await describe(ref.base64)}]()`,
});

Continuous and paged. toMarkdown is one string; toMarkdownPaged returns one entry per Word page plus a combined string with  separators. Page boundaries come from Word's pre-baked pagination hints (renderedPageBreakBefore, explicit w:br type="page", section breaks of type nextPage/evenPage/oddPage). No canvas dep.
Async image handlers. toMarkdownAsync and toMarkdownPagedAsync wrap the base functions and await an imageHandler(ref, ctx) callback per image. The handler's return string substitutes the default ![alt](./images/...) reference. Errors per image push a warning, never throw.
Options. annotations (html | pandoc | strip) for things markdown can't express (comments, tracked changes, headers/footers). trackedChanges (clean | annotate). comments (strip | inline | sidecar). hyperlinks (inline | reference). footnotes (inline | end). headerFooter (paged only: strip | first-page | all). Custom imagePath(info) for naming. Defaults are fidelity-first.
Images. Returned as Map<virtualPath, ImageRef> with raw bytes + base64 + data URL + metadata. Callers decide what to do (data URL, blob upload, vision-model description, drop). A media file referenced N times is registered once.
Playground. examples/markdown-playground: drop a .docx, see rendered Word on the left, markdown on the right, toggle every option live.

Test plan

bun run typecheck clean across the workspace
bun run format clean
bun run check:parity-contract passes (no React/Vue surface touched)
bun run api:extract updated; markdown.api.md snapshot committed
Smoke tested against examples/shared/sample.docx, screenshots/demo.docx, screenshots/pr320/multi-section.docx for continuous, paged, async-with-image-handler, and custom-imagePath
72/72 playwright formatting specs pass (no regression in the editor pipeline)
Visual verification of the playground in Chrome (continuous + paged, all option dropdowns)

Run the playground locally:

bun run --filter './examples/markdown-playground' dev
# http://localhost:5180

…markdown Four entry functions, sync from a parsed Document, async from raw bytes: toMarkdown continuous markdown toMarkdownPaged one entry per Word page + a `combined` string with `` separators toMarkdownAsync toMarkdown + per-image handler callback toMarkdownPagedAsync toMarkdownPaged + per-image handler callback Page boundaries come from Word's pre-baked pagination hints (renderedPageBreakBefore, explicit `w:br type="page"`, section breaks of type nextPage/evenPage/oddPage). No canvas dep. Options: annotations (html | pandoc | strip), trackedChanges (clean | annotate), comments (strip | inline | sidecar), hyperlinks (inline | reference), footnotes (inline | end), headerFooter (paged-only: strip | first-page | all), plus a custom imagePath(info) callback. Images are returned as a Map<virtualPath, ImageRef> so callers decide what to do with bytes (data URL, blob upload, vision-model description, drop). A live drop-a-docx playground ships at examples/markdown-playground: bun run --filter './examples/markdown-playground' dev

vercel · 2026-05-24T15:19:18Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
docx-editor	Error		May 25, 2026 8:13am

eigenpal-release-pal · 2026-05-24T15:19:31Z

All contributors have signed the CLA ✍️ ✅

_{Posted by the CLA bot.}

GFM tables don't support colspan/rowspan, so merged-cell and nested-table DOCX content previously lost structure (one warning per limitation, content squashed into the first cell). Detection now runs per-table: a table with any `gridSpan > 1`, `vMerge`, or nested `<w:tbl>` falls back to inline `<table>` with proper `colspan`/`rowspan`. Simple tables stay GFM. Inline marks inside HTML cells use HTML tags (`<strong>`, `<em>`, `<code>`, `<s>`, `<u>`, `<a>`) because markdown is not parsed inside HTML blocks. Bookmark markers, field chrome, and soft hyphens drop as expected. Hyperlinks resolve `href` or `#anchor` the same way GFM does. Net effect on `screenshots/demo.docx`: 26 -> 0 warnings, all merged-cell and nested-table content now renders correctly on GitHub/Notion/etc.

32 focused bun:test cases under packages/core/src/markdown/markdown.test.ts. Tests build Document literals directly so each case pins one renderer decision without depending on the parser or any .docx fixture. Coverage: - basics: Document input, paragraph join, empty doc warning, bad-input error - inline marks: bold/italic/strike, mark nesting, monospace whitelist - block structure: heading levels from styleId, bullet vs ordered lists - hyperlinks: inline + reference modes, orphan-link warning - tracked changes: annotate/clean modes, pandoc annotations - comments: inline/strip/sidecar modes - tables: GFM path, HTML fallback for gridSpan, rowspan from vMerge, nested table recursion - paged: renderedPageBreakBefore, sectionStart, combined separator, empty-doc returns zero pages, header/footer emission - images: media-path dedupe, async substitution, custom imagePath, handler-error warning Fix a latent bug found by the nested-table test: a nested table inside an HTML cell was falling through to the GFM renderer, which doesn't render inside HTML blocks. Force HTML rendering for nested tables in HTML cells.

Found in the wild on a 6-page banking template (template (1).docx). 1. Empty marker paragraphs caused double-counting. Word's idiom for an explicit page break is an empty paragraph whose only run is a `<w:br type="page"/>`, plus `renderedPageBreakBefore` on both that paragraph AND the next one. The splitter counted both signals, turning a 6-page doc into 11 pages. Detect "pure break paragraphs" (no visible text, only a page-break run), consume them as transitions, and drop them from the output. 2. Header/footer never emitted. `resolveHeaderFooter` only looked at `Section.headers` (the HeaderFooterType-keyed map), which is empty in practice. The Word- authored layout puts headers in `DocxPackage.headers` keyed by rId, with the section's `headerReferences` mapping HeaderFooterType to rId. Resolve via the reference chain, with the package map as a last-resort fallback. Same for footers. 3. Header/footer images failed to resolve. Headers have their own relationships file (word/_rels/header1.xml.rels) whose rIds do NOT live in `pkg.relationships`. The parser already inlines the image into `image.src` as a data URL for this exact reason. Add a fallback in the drawing renderer: if the rels lookup misses but `image.src` is set, emit `![alt](src)` (or `<img src>` in HTML cells). 4. Markdown inside `<header>`/`<footer>` wasn't parsed. CommonMark and GFM stop parsing markdown at HTML block boundaries, so `<header>**Bold**</header>` rendered as literal `**Bold**`. Use blank lines between the wrapper tag and content to make each side parse as its own HTML block; the inner content then parses as normal markdown. Playground: - Default method is `paged` (the more interesting case on load). - Conversion re-runs on every option change. The parser detaches the ArrayBuffer it consumes, so re-runs were silently reading from a consumed view. Clone a fresh `Uint8Array` per call. - `Rendered`/`Raw` toggle in the markdown pane using `react-markdown` + `remark-gfm` + `rehype-raw`. Styles match a clean Word-doc look.

Lets non-DOM runtimes (Node, Bun, workers) inject a Canvas2D-compatible context (`@napi-rs/canvas`, `canvaskit-wasm`, etc.) into the measurement module. Internal layout primitives unchanged; just adds the injection point + re-exports it. Power-user extension point for plugging the full layout engine into a headless markdown converter. The default heuristic page splitter in toMarkdownPaged stays the production path because Word's pre-baked pagination (renderedPageBreakBefore) is more accurate than re-running layout in Node where the doc's authoring fonts aren't available.

Correctness: - Break index.ts ↔ paged.ts import cycle by moving isDocument and badInputError into a new input.ts module. - Tighten isDocument to reject `package.document === null` (was passing because `typeof null === 'object'`). - countRowSpan now aligns vMerge cells by cumulative grid column, not array index, so vertical merges below a horizontal merge in a prior row resolve correctly. - renderHtmlInline handles bookmark/comment-range/move-range markers and inlineSdt content. Previously these silently dropped. - Async substitution now also rewrites HTML `<img>` references, not just markdown image syntax. Images inside HTML-mode table cells now flow through the user's imageHandler. - Empty heading paragraphs render as empty string instead of a bare `#`. - Quote style detection: exact match on `Quote` / `IntenseQuote` styles (no longer accidentally matching `NoQuote` or `BlockQuoteCustom`). - Multi-section header fallback: only fire the package-map fallback when there's exactly one header in the doc. Multi-section docs without proper refs no longer surface the wrong section's header. Simplification: - Drop `RenderTableOptions` interface — never passed; row-range slicing was reserved for a layout-driven mode that never landed. - Drop `manualBase64Encode` + alphabet (~30 lines). Buffer/btoa cover every supported runtime. - Extract shared trailers (footnotes / hyperlink refs / comments sidecar) into trailers.ts. Was duplicated between index.ts and paged.ts. - Extract `newContext` into context.ts. Both entry points now use the same builder. - Drop `footnotes: 'inline'` from public types — documented but never implemented. Pre-1.0 cleanup. - Collapse the 5 wrap functions in annotations.ts to a single helper parameterized by tag/class/mode. - Drop dead `commentById`, `void cellIdx`, `as Run` casts after type-narrowing, `(item as { content?: ParagraphContent[] })` paranoia. - Route all renderer warnings through `pushWarning` so the documented "deduped" contract is actually true. Tests: - All 32 unit tests still pass. - Test for inline comment wrapper updated to be attribute-order- agnostic so future attribute reorderings don't break it.

…ignals Programmatic DOCX files (template tools, freshly generated reports) often ship without `renderedPageBreakBefore`, `<w:br type="page"/>`, section breaks, or `pageBreakBefore` paragraph properties. The heuristic page splitter has nothing to act on and the whole body lands on a single page. That's the correct mathematical answer, but it surprises callers who expect a multi-page result. Detect the case: when the body has more than ~25 paragraphs and zero pagination signals fired, push a warning explaining what happened and suggesting the two fixes (open in Word and resave to bake in the cache, or use toMarkdown for continuous output). Short docs (< 25 paragraphs) don't warn — a 3-line memo really is one page. Docs with any signal don't warn — the splitter worked. 2 new tests cover the warn / don't-warn boundaries.

The heuristic page splitter relies on Word's pre-baked pagination cache (`renderedPageBreakBefore`) plus explicit break signals. Docs that have never been opened in Word — programmatically generated, fresh template output — carry none of these and collapse to a single page. This wires the existing core layout engine as an opt-in fallback so the same Document → ProseDoc → FlowBlock → measureBlocks → layoutDocument pipeline that paginates the live editor in the browser also paginates headless markdown output in Node and Bun. New option on `PagedMarkdownOptions`: `useLayoutEngine: boolean | 'fallback'`. Only honored by the async paged variant (`toMarkdownPagedAsync`) because the layout engine needs a Canvas2D context. In the browser the DOM provides one; in Node we lazy-import `@napi-rs/canvas` as an optional peer dep. Falls back silently to the heuristic on any failure (canvas unavailable, layout error). `useLayoutEngine: 'fallback'` runs the heuristic first and only re-paginates when the heuristic produced a single page on a substantial doc — the recommended production setting. `useLayoutEngine: true` always runs the layout engine. Caveats: - Won't byte-match Word; expect ±1 page on cacheless docs because Word's renderer differs from layout-engine + Skia metrics. - @napi-rs/canvas is a native binary; some serverless runtimes (CF Workers) won't load it. The fallback degrades gracefully. Also: - `imageHandler` is now optional on `toMarkdownAsync` / `toMarkdownPagedAsync` so callers can use the async variants purely for layout-engine pagination without supplying an image handler. - New `renderFromGroups` helper in paged.ts so both the heuristic (groups from splitIntoPages) and the fallback (groups from headlessLayout.computePagedGroups) share rendering code. - Bumped agents tsconfig lib to include `DOM.Iterable` — pre-existing files use iterator syntax on NodeList that now reaches the agents package via the new fallback import chain. Test: 60-paragraph cacheless doc went from 1 page (heuristic) to 4 pages (layout fallback). Cached doc still gives 6 pages because the fallback only triggers when the heuristic indicates no pagination signals fired.

…ss layout The browser editor's layout engine produces Word-matching pagination because Chromium loads Google's Croscore Office-font substitutes (Carlito for Calibri, Caladea for Cambria, Arimo for Arial, Tinos for Times New Roman, Cousine for Courier New) via `<link>` injection at parse time, and `buildFontString`'s CSS cascade resolves to those substitutes when measureText runs. In Node + Bun, none of that happens — substituted fonts aren't available, the cascade falls through to whatever Skia picks as default, and measurements diverge. Even adding `@napi-rs/canvas` doesn't fix it because Skia has no Office fonts and no Croscore substitutes out-of-the-box. New `officeFonts.ts` module: on first `computePagedGroups` call, download five Croscore TTFs from canonical GitHub URLs into a temp-dir cache, then register each TTF under both its real name AND every Office name it substitutes for (Carlito → Calibri, Calibri Light, Aptos, Aptos Display; Arimo → Arial, Helvetica; Tinos → Times New Roman, Georgia; etc.). That makes the headless layout engine produce stable, consistent metrics across runs and machines. Caveats (called out in the JSDoc): - Skia and Blink still compute glyph kerning/bearings slightly differently, so the layout engine in Node can diverge by ~1 page from the same code in the browser on the same doc. The heuristic remains the gold standard when Word has touched the doc (matches Word exactly via renderedPageBreakBefore cache); the layout engine fallback is for cache-less docs where "approximate" beats "one page". - Download fails silently in offline / sandboxed environments; the cascade then falls back to system defaults with a quality hit. Tests: 35/35 pass. Tested end-to-end with the user's 67-paragraph template.docx (heuristic still 6 pages matching Word; layout engine forced produces 7 ±1 with substitutes, consistent across runs).

Spent a session investigating why the layout-engine fallback in Node diverges by 1 page from Word's pagination on a specific cache-less doc. Findings: - Word's cache (renderedPageBreakBefore) gives exact pagination. Heuristic reads it directly: byte-exact for cache-bearing docs. - Browser editor runs the same layout engine and matches Word, but it benefits from Chromium's tightly integrated Skia + HarfBuzz + Blink line-break logic, which approximates Word's text shaper. - Pure @napi-rs/canvas (vanilla Skia) measures line widths slightly differently than Chromium. Per-line deltas of a few percent accumulate over a 60+ paragraph doc and can tip content across page boundaries. On the test doc, 19 blocks measured 1152px in Node vs ~978px in Word/browser — a 15% gap consistent with text-shaping differences. Bridging that gap headlessly requires either: - Headless Chrome (puppeteer): expensive, heavy install. - A from-scratch text shaper matching Word's metrics: out of scope. Updated the JSDoc on `useLayoutEngine` to describe this honestly: caveat names the actual root cause (text shaping, not glyph metrics), recommends 'fallback' as the production setting, and points at the two known paths to exact matching (open-and-resave in Word, or headless Chrome).

Ran code-review, simplifier, and DX agents. Acted on the high-impact findings. Correctness: - headlessLayout.computePagedGroups was misaligned for docs with anchored textboxes (toProseDoc emits multiple PM nodes per source paragraph; the old parallel-walk assumed 1:1 mapping). Rewrote to map by ProseMirror positions: snapshot each source body block's pmStart from toProseDoc's output and bisect for each fragment's pmStart at render time. Handles textboxes, BlockSdt, multi-node-per-source cases. - canvasReady and registerOnce memos no longer poison the process on failure. A transient network glitch during font download is retried on the next call instead of permanently disabling the layout engine. - registerOfficeSubstitutes returns success only when at least one font actually registered. - Canvas shrunk from 2000x2000 to 1x1 — measurement doesn't paint. Simplifications: - Merged input.ts + context.ts + trailers.ts into one internals.ts (each was sub-50 lines, all consumed only by index.ts/paged.ts/ async.ts). Markdown dir goes 16 → 14 files. - Dropped MarkdownOptionsBase from public type exports (consumers use MarkdownOptions / PagedMarkdownOptions; the base interface is an internal building block). - Dropped `footnotes: 'end'` single-value option from the public types. Reserved for future expansion; documenting placement in prose for now. - Named magic numbers: DEFAULT_PAGE_WIDTH_TWIPS, DEFAULT_PAGE_HEIGHT_TWIPS, DEFAULT_MARGIN_TWIPS. DX: - Less aggressive `<` escape: only escapes plausible HTML tag starts (`<\/?[A-Za-z][\\w-]*(?=[\\s/>])`). Prose like "x < y" stays untouched. - badInputError special-cases strings: when a caller passes a filename, the error message points them at fs.readFile(). - Core README leads with DOCX→Markdown (was leading with editor). Added a "Which function do I want?" decision table and a comparison table vs Pandoc/mammoth. - Server-side pagination section names the @napi-rs/canvas peer dep install command up front. - Root README markdown section gets a real code snippet (was just prose). Tests: 35/35 pass.

…rface list CI was failing two exports-map invariants: - typesVersions had no `markdown` key, so consumers on TS <4.7 wouldn't resolve types for `@eigenpal/docx-editor-core/markdown`. - The "curated surface" test pins the full set of approved subpaths; `./markdown` wasn't on the list and tripped the silent-growth guard. Add the typesVersions entry next to `agent` (the closest analog: a non-experimental headless subpath) and add `./markdown` to the approved set in the test fixture.

vercel Bot deployed to Preview May 24, 2026 15:20 View deployment

vercel Bot deployed to Preview May 24, 2026 18:46 View deployment

vercel Bot deployed to Preview May 24, 2026 18:51 View deployment

vercel Bot deployed to Preview May 24, 2026 19:23 View deployment

vercel Bot deployed to Preview May 24, 2026 19:39 View deployment

vercel Bot deployed to Preview May 24, 2026 20:22 View deployment

vercel Bot deployed to Preview May 24, 2026 20:27 View deployment

vercel Bot had a problem deploying to Preview May 25, 2026 05:53 Failure

vercel Bot had a problem deploying to Preview May 25, 2026 06:35 Failure

vercel Bot had a problem deploying to Preview May 25, 2026 07:14 Failure

vercel Bot had a problem deploying to Preview May 25, 2026 07:46 Failure

vercel Bot had a problem deploying to Preview May 25, 2026 08:13 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): DOCX to Markdown converter#595

feat(core): DOCX to Markdown converter#595
jedrazb wants to merge 12 commits into
mainfrom
feat/docx-to-markdown

jedrazb commented May 24, 2026

Uh oh!

vercel Bot commented May 24, 2026 •

edited

Loading

Uh oh!

eigenpal-release-pal Bot commented May 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jedrazb commented May 24, 2026

Summary

Test plan

Uh oh!

vercel Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eigenpal-release-pal Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 24, 2026 •

edited

Loading

eigenpal-release-pal Bot commented May 24, 2026 •

edited

Loading