-
Notifications
You must be signed in to change notification settings - Fork 3.4k
feat: Add DeepSeek-OCR integration #2721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add DeepSeek-OCR integration #2721
Conversation
|
✅ DCO Check Passed Thanks @rafaeltuelho, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
74b3bcd to
076c3ad
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
It seems the reason for the lack of coverage in I had to use Google Colab (T4) to test it manually. What is the recommended approach here? |
076c3ad to
4e93020
Compare
|
@rafaeltuelho Thanks for the starting the contribution. It is definitely something which was on our radar as well. The key question we would like to assess is if this model should be exposed as OCR engine or as model in the VLM pipeline. |
|
@rafaeltuelho not sure if that aligns with what @dolfim-ibm refers to: as a user it would be amazing to be able to integrate deepseek ocr as an external service, i.e., via api calls, instead of as a local model as part of the regular pipeline |
@simonschoe Untested, but I think you could already use DeepSeekOCR with the markdown prompt for the VLM API Docling settings. https://docling-project.github.io/docling/examples/vlm_pipeline_api_model/ |
@dolfim-ibm That's a good question. I remember I have only used/tested the VLM for Picture description. Is it possible to run the VLM pipeline for OCR-based conversion? I tried to follow the same approach used by other OCR (eg: EasyOcr) already supported in Docling |
| >>> options = DeepSeekOcrOptions(prompt="<image>\\nConvert to markdown.") | ||
| """ | ||
|
|
||
| kind: ClassVar[Literal["deepseecocr"]] = "deepseecocr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| kind: ClassVar[Literal["deepseecocr"]] = "deepseecocr" | |
| kind: ClassVar[Literal["deepseekocr"]] = "deepseekocr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also adapt all the other occurrences of deepseecocr
| ) | ||
| self.options: DeepSeekOcrOptions | ||
|
|
||
| self.scale = 3 # multiplier for 72 dpi == 216 dpi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this "simply" copied from the other OCR models or is it the preferred value for DeepSeek-OCR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is being used in other OCRs as well (EasyOCR and Tesseract)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I was wondering if the scale=3 is good for DeepSeek-OCR or maybe it works better with other scaling factor. Usually the other OCR engines carry a value which is "optimal" for them.
| 'transformers (>=4.46.0,<5.0.0)', | ||
| 'torch (>=2.0.0)', | ||
| 'einops', | ||
| 'Pillow (>=10.0.0)', | ||
| 'addict', | ||
| 'easydict', | ||
| 'matplotlib', | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing what seems not to be used/needed
| 'transformers (>=4.46.0,<5.0.0)', | |
| 'torch (>=2.0.0)', | |
| 'einops', | |
| 'Pillow (>=10.0.0)', | |
| 'addict', | |
| 'easydict', | |
| 'matplotlib', | |
| ] | |
| 'transformers (>=4.46.0,<5.0.0)', | |
| 'torch (>=2.0.0)', | |
| 'Pillow (>=10.0.0)', | |
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my tests, DeepSeek-OCR required addict matplotlib easydict to be present in order to process the parser...
tests/test_e2e_ocr_conversion.py
Outdated
|
|
||
| # DeepSeek OCR - requires GPU (CUDA or MPS) and transformers | ||
| # Only run if explicitly enabled via environment variable | ||
| # Set DOCLING_TEST_DEEPSEECOCR=true to include DeepSeek-OCR tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of the ENV, could we detect if deepseek-ocr is runnable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by runnable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like if we have a CUDA GPU or MPS is available we run otherwise we skip.
docling/models/deepseek_ocr_model.py
Outdated
| _log.error( | ||
| "DeepSeek-OCR MPS model incompatibility detected!\n\n" | ||
| "The MPS-compatible model 'Dogacel/DeepSeek-OCR-Metal-MPS' uses deprecated " | ||
| "transformers APIs (DynamicCache.seen_tokens) that are not compatible with " | ||
| "your current transformers version.\n\n" | ||
| "This is a known issue with the community-maintained MPS fork.\n" | ||
| "See: https://github.com/Dogacel/DeepSeek-OCR-Metal-MPS/issues\n\n" | ||
| "Workarounds:\n" | ||
| " 1. Use a different OCR engine that supports MPS:\n" | ||
| " - EasyOcrOptions(lang=['en'])\n" | ||
| " - RapidOcrOptions()\n" | ||
| " 2. Wait for the MPS model to be updated for newer transformers versions\n" | ||
| " 3. Test in an isolated environment with transformers==4.43.4 (not recommended)\n" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this issue is solved. I updated the upstream to support latest transformers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dogacel there seems to be other issues with your MPS version. Now I get the following
DeepSeek-OCR inference failed: 'DynamicCache' object has no attribute 'get_max_length'
I also observed other issues like missing get_usable_length.
Unfortunately this will block the usage of this version of the model in Docling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dolfim-ibm Sorry for the late response, I've reproduced this issue, applied the fix and re-tested to confirm it is now working.
To ensure behaviour is consistent, I've copied the classes/functions from the older version of transformers into the model itself, otherwise it required some architectural changes to the existing models (i.e. attention layers shouldn't calculate position embeddings themselves).
I've also tested the performance of model and so far I don't see any obvious issues.
bb7b8ff to
a748add
Compare
Yes, but then I wouldn't be able to use it tandem with the "classic" docling pipeline (including the layout and table models), would I? I thought the VLM model is used in place of the "classic" pipeline. |
dolfim-ibm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rafaeltuelho I tried to run the PR locally but I'm not sure it actually works. I always get back an empty file. You can also check it via the CLI with
uv run docling --ocr-engine=deepseekocr tests/data_scanned/ocr_test.pdf|
@rafaeltuelho here is a short summary of what we would like to do for the DeepSeek-OCR support.
Our plan is to start from your PR and prepare expand on the above points (at least the use cases 1 and 2). |
For issue number 1, I've reproduced this issue, applied the fix and re-tested to confirm it is now working. Let me know if you face any issues that I might have missed. |
Great. Looking forward to give it a try! @Dogacel do you think we could use your model fork also for the CUDA environments? I think the official repo is behind with transformers support and they original team doesn't seem to provide any update/fix. This could also be very interesting (once released) huggingface/transformers#41797. |
For CUDA support, I haven't tested it as I didn't have a CUDA device until recently. Most likely it should work, because my patch is a combination of porting old attention code into deepseek (shared between CPU/CUDA/MPS) + updating cache logic. However, I now have access to an RTX 4080 and I can test it once I have some free time this weekend. Once transformers adds support, I agree that we won't really need my patch once that is merged. But so far I see an idle PR and I would rather maintain my fork until everyone gets access, I know llama.cpp folks used it as reference during their implementation ggml-org/llama.cpp#16676. |
|
Hey! Pablo from |
I see the issue. Will fix it. |
- Add DeepSeekOcrModel with automatic device detection (CUDA → MPS → Error) - Add DeepSeekOcrOptions for configuring the OCR engine - Support CUDA with bfloat16 and flash_attention_2 (optimal performance) - Support MPS (Apple Silicon) with float16 and eager attention (requires PyTorch 2.7.0+) - Auto-switch to MPS-compatible model (Dogacel/DeepSeek-OCR-Metal-MPS) on Apple Silicon - Add mock-based unit tests for CI coverage without GPU hardware - Update E2E tests with DOCLING_TEST_DEEPSEECOCR environment variable guard Note: MPS support requires PyTorch 2.7.0+ for aten::_upsample_bicubic2d_aa operator. See: https://github.com/Dogacel/DeepSeek-OCR-Metal-MPS/discussions Signed-off-by: Rafael T. C. Soares <[email protected]>
a748add to
33c6056
Compare
I've tested my fix and it seems to work fine. Just wondering if you are still open to it considering transformers team is soon going to publish the model? Maybe I can update my huggingface fork with that model to ensure a consistent implementation. |

Note:
Resolves #2497
Checklist: