Skip to content

Conversation

@pallaprolus
Copy link

Fixes #68

This PR addresses the issue where partially numbered lists (common in MasterFormat documents, e.g., .1, .2) are extracted as plain text lines indistinguishable from regular paragraphs.

Changes:

  • Adds a lightweight regex post-processing step in _pdf_converter.py to identify lines starting with .Number and convert them into Markdown lists (- .Number).
  • This keeps the solution dependency-free and lightweight as requested by maintainers.

Verification:

  • Verified that lines like .1 Item are now converted to - .1 Item.
  • Ran standard tests to ensure no regressions.

@pallaprolus pallaprolus force-pushed the fix/issue-68-pdf-lists branch from 87d7b54 to 76a674a Compare December 30, 2025 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PDF parsing doesn't support partially numbered lists

1 participant