feat(core): DOCX to Markdown converter#595
Open
jedrazb wants to merge 12 commits into
Open
Conversation
…markdown
Four entry functions, sync from a parsed Document, async from raw bytes:
toMarkdown continuous markdown
toMarkdownPaged one entry per Word page + a `combined` string
with `<!-- page N -->` separators
toMarkdownAsync toMarkdown + per-image handler callback
toMarkdownPagedAsync toMarkdownPaged + per-image handler callback
Page boundaries come from Word's pre-baked pagination hints
(renderedPageBreakBefore, explicit `w:br type="page"`, section breaks of
type nextPage/evenPage/oddPage). No canvas dep.
Options: annotations (html | pandoc | strip), trackedChanges (clean |
annotate), comments (strip | inline | sidecar), hyperlinks (inline |
reference), footnotes (inline | end), headerFooter (paged-only: strip |
first-page | all), plus a custom imagePath(info) callback. Images are
returned as a Map<virtualPath, ImageRef> so callers decide what to do
with bytes (data URL, blob upload, vision-model description, drop).
A live drop-a-docx playground ships at examples/markdown-playground:
bun run --filter './examples/markdown-playground' dev
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
|
All contributors have signed the CLA ✍️ ✅ Posted by the CLA bot. |
GFM tables don't support colspan/rowspan, so merged-cell and nested-table DOCX content previously lost structure (one warning per limitation, content squashed into the first cell). Detection now runs per-table: a table with any `gridSpan > 1`, `vMerge`, or nested `<w:tbl>` falls back to inline `<table>` with proper `colspan`/`rowspan`. Simple tables stay GFM. Inline marks inside HTML cells use HTML tags (`<strong>`, `<em>`, `<code>`, `<s>`, `<u>`, `<a>`) because markdown is not parsed inside HTML blocks. Bookmark markers, field chrome, and soft hyphens drop as expected. Hyperlinks resolve `href` or `#anchor` the same way GFM does. Net effect on `screenshots/demo.docx`: 26 -> 0 warnings, all merged-cell and nested-table content now renders correctly on GitHub/Notion/etc.
32 focused bun:test cases under packages/core/src/markdown/markdown.test.ts.
Tests build Document literals directly so each case pins one renderer
decision without depending on the parser or any .docx fixture.
Coverage:
- basics: Document input, paragraph join, empty doc warning, bad-input error
- inline marks: bold/italic/strike, mark nesting, monospace whitelist
- block structure: heading levels from styleId, bullet vs ordered lists
- hyperlinks: inline + reference modes, orphan-link warning
- tracked changes: annotate/clean modes, pandoc annotations
- comments: inline/strip/sidecar modes
- tables: GFM path, HTML fallback for gridSpan, rowspan from vMerge,
nested table recursion
- paged: renderedPageBreakBefore, sectionStart, combined separator,
empty-doc returns zero pages, header/footer emission
- images: media-path dedupe, async substitution, custom imagePath,
handler-error warning
Fix a latent bug found by the nested-table test: a nested table inside
an HTML cell was falling through to the GFM renderer, which doesn't
render inside HTML blocks. Force HTML rendering for nested tables in
HTML cells.
Found in the wild on a 6-page banking template (template (1).docx). 1. Empty marker paragraphs caused double-counting. Word's idiom for an explicit page break is an empty paragraph whose only run is a `<w:br type="page"/>`, plus `renderedPageBreakBefore` on both that paragraph AND the next one. The splitter counted both signals, turning a 6-page doc into 11 pages. Detect "pure break paragraphs" (no visible text, only a page-break run), consume them as transitions, and drop them from the output. 2. Header/footer never emitted. `resolveHeaderFooter` only looked at `Section.headers` (the HeaderFooterType-keyed map), which is empty in practice. The Word- authored layout puts headers in `DocxPackage.headers` keyed by rId, with the section's `headerReferences` mapping HeaderFooterType to rId. Resolve via the reference chain, with the package map as a last-resort fallback. Same for footers. 3. Header/footer images failed to resolve. Headers have their own relationships file (word/_rels/header1.xml.rels) whose rIds do NOT live in `pkg.relationships`. The parser already inlines the image into `image.src` as a data URL for this exact reason. Add a fallback in the drawing renderer: if the rels lookup misses but `image.src` is set, emit `` (or `<img src>` in HTML cells). 4. Markdown inside `<header>`/`<footer>` wasn't parsed. CommonMark and GFM stop parsing markdown at HTML block boundaries, so `<header>**Bold**</header>` rendered as literal `**Bold**`. Use blank lines between the wrapper tag and content to make each side parse as its own HTML block; the inner content then parses as normal markdown. Playground: - Default method is `paged` (the more interesting case on load). - Conversion re-runs on every option change. The parser detaches the ArrayBuffer it consumes, so re-runs were silently reading from a consumed view. Clone a fresh `Uint8Array` per call. - `Rendered`/`Raw` toggle in the markdown pane using `react-markdown` + `remark-gfm` + `rehype-raw`. Styles match a clean Word-doc look.
Lets non-DOM runtimes (Node, Bun, workers) inject a Canvas2D-compatible context (`@napi-rs/canvas`, `canvaskit-wasm`, etc.) into the measurement module. Internal layout primitives unchanged; just adds the injection point + re-exports it. Power-user extension point for plugging the full layout engine into a headless markdown converter. The default heuristic page splitter in toMarkdownPaged stays the production path because Word's pre-baked pagination (renderedPageBreakBefore) is more accurate than re-running layout in Node where the doc's authoring fonts aren't available.
Correctness:
- Break index.ts ↔ paged.ts import cycle by moving isDocument and
badInputError into a new input.ts module.
- Tighten isDocument to reject `package.document === null` (was passing
because `typeof null === 'object'`).
- countRowSpan now aligns vMerge cells by cumulative grid column, not
array index, so vertical merges below a horizontal merge in a prior
row resolve correctly.
- renderHtmlInline handles bookmark/comment-range/move-range markers
and inlineSdt content. Previously these silently dropped.
- Async substitution now also rewrites HTML `<img>` references, not just
markdown image syntax. Images inside HTML-mode table cells now flow
through the user's imageHandler.
- Empty heading paragraphs render as empty string instead of a bare `#`.
- Quote style detection: exact match on `Quote` / `IntenseQuote` styles
(no longer accidentally matching `NoQuote` or `BlockQuoteCustom`).
- Multi-section header fallback: only fire the package-map fallback
when there's exactly one header in the doc. Multi-section docs
without proper refs no longer surface the wrong section's header.
Simplification:
- Drop `RenderTableOptions` interface — never passed; row-range slicing
was reserved for a layout-driven mode that never landed.
- Drop `manualBase64Encode` + alphabet (~30 lines). Buffer/btoa cover
every supported runtime.
- Extract shared trailers (footnotes / hyperlink refs / comments
sidecar) into trailers.ts. Was duplicated between index.ts and
paged.ts.
- Extract `newContext` into context.ts. Both entry points now use the
same builder.
- Drop `footnotes: 'inline'` from public types — documented but never
implemented. Pre-1.0 cleanup.
- Collapse the 5 wrap functions in annotations.ts to a single helper
parameterized by tag/class/mode.
- Drop dead `commentById`, `void cellIdx`, `as Run` casts after
type-narrowing, `(item as { content?: ParagraphContent[] })` paranoia.
- Route all renderer warnings through `pushWarning` so the documented
"deduped" contract is actually true.
Tests:
- All 32 unit tests still pass.
- Test for inline comment wrapper updated to be attribute-order-
agnostic so future attribute reorderings don't break it.
…ignals Programmatic DOCX files (template tools, freshly generated reports) often ship without `renderedPageBreakBefore`, `<w:br type="page"/>`, section breaks, or `pageBreakBefore` paragraph properties. The heuristic page splitter has nothing to act on and the whole body lands on a single page. That's the correct mathematical answer, but it surprises callers who expect a multi-page result. Detect the case: when the body has more than ~25 paragraphs and zero pagination signals fired, push a warning explaining what happened and suggesting the two fixes (open in Word and resave to bake in the cache, or use toMarkdown for continuous output). Short docs (< 25 paragraphs) don't warn — a 3-line memo really is one page. Docs with any signal don't warn — the splitter worked. 2 new tests cover the warn / don't-warn boundaries.
The heuristic page splitter relies on Word's pre-baked pagination cache (`renderedPageBreakBefore`) plus explicit break signals. Docs that have never been opened in Word — programmatically generated, fresh template output — carry none of these and collapse to a single page. This wires the existing core layout engine as an opt-in fallback so the same Document → ProseDoc → FlowBlock → measureBlocks → layoutDocument pipeline that paginates the live editor in the browser also paginates headless markdown output in Node and Bun. New option on `PagedMarkdownOptions`: `useLayoutEngine: boolean | 'fallback'`. Only honored by the async paged variant (`toMarkdownPagedAsync`) because the layout engine needs a Canvas2D context. In the browser the DOM provides one; in Node we lazy-import `@napi-rs/canvas` as an optional peer dep. Falls back silently to the heuristic on any failure (canvas unavailable, layout error). `useLayoutEngine: 'fallback'` runs the heuristic first and only re-paginates when the heuristic produced a single page on a substantial doc — the recommended production setting. `useLayoutEngine: true` always runs the layout engine. Caveats: - Won't byte-match Word; expect ±1 page on cacheless docs because Word's renderer differs from layout-engine + Skia metrics. - @napi-rs/canvas is a native binary; some serverless runtimes (CF Workers) won't load it. The fallback degrades gracefully. Also: - `imageHandler` is now optional on `toMarkdownAsync` / `toMarkdownPagedAsync` so callers can use the async variants purely for layout-engine pagination without supplying an image handler. - New `renderFromGroups` helper in paged.ts so both the heuristic (groups from splitIntoPages) and the fallback (groups from headlessLayout.computePagedGroups) share rendering code. - Bumped agents tsconfig lib to include `DOM.Iterable` — pre-existing files use iterator syntax on NodeList that now reaches the agents package via the new fallback import chain. Test: 60-paragraph cacheless doc went from 1 page (heuristic) to 4 pages (layout fallback). Cached doc still gives 6 pages because the fallback only triggers when the heuristic indicates no pagination signals fired.
…ss layout The browser editor's layout engine produces Word-matching pagination because Chromium loads Google's Croscore Office-font substitutes (Carlito for Calibri, Caladea for Cambria, Arimo for Arial, Tinos for Times New Roman, Cousine for Courier New) via `<link>` injection at parse time, and `buildFontString`'s CSS cascade resolves to those substitutes when measureText runs. In Node + Bun, none of that happens — substituted fonts aren't available, the cascade falls through to whatever Skia picks as default, and measurements diverge. Even adding `@napi-rs/canvas` doesn't fix it because Skia has no Office fonts and no Croscore substitutes out-of-the-box. New `officeFonts.ts` module: on first `computePagedGroups` call, download five Croscore TTFs from canonical GitHub URLs into a temp-dir cache, then register each TTF under both its real name AND every Office name it substitutes for (Carlito → Calibri, Calibri Light, Aptos, Aptos Display; Arimo → Arial, Helvetica; Tinos → Times New Roman, Georgia; etc.). That makes the headless layout engine produce stable, consistent metrics across runs and machines. Caveats (called out in the JSDoc): - Skia and Blink still compute glyph kerning/bearings slightly differently, so the layout engine in Node can diverge by ~1 page from the same code in the browser on the same doc. The heuristic remains the gold standard when Word has touched the doc (matches Word exactly via renderedPageBreakBefore cache); the layout engine fallback is for cache-less docs where "approximate" beats "one page". - Download fails silently in offline / sandboxed environments; the cascade then falls back to system defaults with a quality hit. Tests: 35/35 pass. Tested end-to-end with the user's 67-paragraph template.docx (heuristic still 6 pages matching Word; layout engine forced produces 7 ±1 with substitutes, consistent across runs).
Spent a session investigating why the layout-engine fallback in Node diverges by 1 page from Word's pagination on a specific cache-less doc. Findings: - Word's cache (renderedPageBreakBefore) gives exact pagination. Heuristic reads it directly: byte-exact for cache-bearing docs. - Browser editor runs the same layout engine and matches Word, but it benefits from Chromium's tightly integrated Skia + HarfBuzz + Blink line-break logic, which approximates Word's text shaper. - Pure @napi-rs/canvas (vanilla Skia) measures line widths slightly differently than Chromium. Per-line deltas of a few percent accumulate over a 60+ paragraph doc and can tip content across page boundaries. On the test doc, 19 blocks measured 1152px in Node vs ~978px in Word/browser — a 15% gap consistent with text-shaping differences. Bridging that gap headlessly requires either: - Headless Chrome (puppeteer): expensive, heavy install. - A from-scratch text shaper matching Word's metrics: out of scope. Updated the JSDoc on `useLayoutEngine` to describe this honestly: caveat names the actual root cause (text shaping, not glyph metrics), recommends 'fallback' as the production setting, and points at the two known paths to exact matching (open-and-resave in Word, or headless Chrome).
Ran code-review, simplifier, and DX agents. Acted on the high-impact findings. Correctness: - headlessLayout.computePagedGroups was misaligned for docs with anchored textboxes (toProseDoc emits multiple PM nodes per source paragraph; the old parallel-walk assumed 1:1 mapping). Rewrote to map by ProseMirror positions: snapshot each source body block's pmStart from toProseDoc's output and bisect for each fragment's pmStart at render time. Handles textboxes, BlockSdt, multi-node-per-source cases. - canvasReady and registerOnce memos no longer poison the process on failure. A transient network glitch during font download is retried on the next call instead of permanently disabling the layout engine. - registerOfficeSubstitutes returns success only when at least one font actually registered. - Canvas shrunk from 2000x2000 to 1x1 — measurement doesn't paint. Simplifications: - Merged input.ts + context.ts + trailers.ts into one internals.ts (each was sub-50 lines, all consumed only by index.ts/paged.ts/ async.ts). Markdown dir goes 16 → 14 files. - Dropped MarkdownOptionsBase from public type exports (consumers use MarkdownOptions / PagedMarkdownOptions; the base interface is an internal building block). - Dropped `footnotes: 'end'` single-value option from the public types. Reserved for future expansion; documenting placement in prose for now. - Named magic numbers: DEFAULT_PAGE_WIDTH_TWIPS, DEFAULT_PAGE_HEIGHT_TWIPS, DEFAULT_MARGIN_TWIPS. DX: - Less aggressive `<` escape: only escapes plausible HTML tag starts (`<\/?[A-Za-z][\\w-]*(?=[\\s/>])`). Prose like "x < y" stays untouched. - badInputError special-cases strings: when a caller passes a filename, the error message points them at fs.readFile(). - Core README leads with DOCX→Markdown (was leading with editor). Added a "Which function do I want?" decision table and a comparison table vs Pandoc/mammoth. - Server-side pagination section names the @napi-rs/canvas peer dep install command up front. - Root README markdown section gets a real code snippet (was just prose). Tests: 35/35 pass.
…rface list CI was failing two exports-map invariants: - typesVersions had no `markdown` key, so consumers on TS <4.7 wouldn't resolve types for `@eigenpal/docx-editor-core/markdown`. - The "curated surface" test pins the full set of approved subpaths; `./markdown` wasn't on the list and tripped the silent-growth guard. Add the typesVersions entry next to `agent` (the closest analog: a non-experimental headless subpath) and add `./markdown` to the approved set in the test fixture.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New subpath
@eigenpal/docx-editor-core/markdownwith four entry functions and a drop-a-docx playground.toMarkdownis one string;toMarkdownPagedreturns one entry per Word page plus acombinedstring with<!-- page N -->separators. Page boundaries come from Word's pre-baked pagination hints (renderedPageBreakBefore, explicitw:br type="page", section breaks of typenextPage/evenPage/oddPage). No canvas dep.toMarkdownAsyncandtoMarkdownPagedAsyncwrap the base functions and await animageHandler(ref, ctx)callback per image. The handler's return string substitutes the defaultreference. Errors per image push a warning, never throw.annotations(html|pandoc|strip) for things markdown can't express (comments, tracked changes, headers/footers).trackedChanges(clean|annotate).comments(strip|inline|sidecar).hyperlinks(inline|reference).footnotes(inline|end).headerFooter(paged only:strip|first-page|all). CustomimagePath(info)for naming. Defaults are fidelity-first.Map<virtualPath, ImageRef>with raw bytes + base64 + data URL + metadata. Callers decide what to do (data URL, blob upload, vision-model description, drop). A media file referenced N times is registered once.examples/markdown-playground: drop a.docx, see rendered Word on the left, markdown on the right, toggle every option live.Test plan
bun run typecheckclean across the workspacebun run formatcleanbun run check:parity-contractpasses (no React/Vue surface touched)bun run api:extractupdated;markdown.api.mdsnapshot committedexamples/shared/sample.docx,screenshots/demo.docx,screenshots/pr320/multi-section.docxfor continuous, paged, async-with-image-handler, and custom-imagePathformattingspecs pass (no regression in the editor pipeline)Run the playground locally: