feat: Add DeepSeek-OCR integration #2721

rafaeltuelho · 2025-12-04T01:34:01Z

Add DeepSeekOcrModel with automatic device detection (CUDA/MPS)
CUDA uses bfloat16 precision and flash_attention_2 (optimal)
MPS uses float16 precision and eager attention (Apple Silicon fallback)
Auto-switch to MPS-compatible model (Dogacel/DeepSeek-OCR-Metal-MPS)
Add PyTorch 2.7.0+ version validation for MPS support
Add clear error messages for device/version incompatibilities
Update test_e2e_ocr_conversion.py with CUDA/MPS device support
Add manual test script for DeepSeek-OCR validation
Update documentation with MPS support information

Note:

MPS support requires PyTorch 2.7.0+ and is currently blocked by a transformers version incompatibility in the community MPS model. See: https://huggingface.co/Dogacel/DeepSeek-OCR-Metal-MPS/discussions
Tested on Google Colab T4 using this Notebook https://colab.research.google.com/gist/rafaeltuelho/37dac562ee26ac291d08d288f3774e04/docling_deepseekocr_testing.ipynb

Resolves #2497

Checklist:

[y] Documentation has been updated, if necessary.
[y] Examples have been added, if necessary.
[y] Tests have been added, if necessary.

github-actions · 2025-12-04T01:34:12Z

✅ DCO Check Passed

Thanks @rafaeltuelho, all your commits are properly signed off. 🎉

dosubot · 2025-12-04T01:34:24Z

Related Documentation

Checked 4 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

mergify · 2025-12-04T01:34:36Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

codecov · 2025-12-04T10:02:14Z

Codecov Report

❌ Patch coverage is 60.55556% with 71 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/models/deepseek_ocr_model.py	56.70%	71 Missing ⚠️

📢 Thoughts on this report? Let us know!

rafaeltuelho · 2025-12-04T20:05:11Z

It seems the reason for the lack of coverage in docling/models/deepseek_ocr_model.py is that the CI tests can't run DeepSeek-OCR tests because no GPU (CUDA/MPS) is available in the CI environment. Also, the DOCLING_TEST_DEEPSEECOCR environment variable is not set.

I had to use Google Colab (T4) to test it manually.

What is the recommended approach here?

dolfim-ibm · 2025-12-05T09:09:22Z

@rafaeltuelho Thanks for the starting the contribution. It is definitely something which was on our radar as well.

The key question we would like to assess is if this model should be exposed as OCR engine or as model in the VLM pipeline.

simonschoe · 2025-12-05T14:55:13Z

@rafaeltuelho not sure if that aligns with what @dolfim-ibm refers to: as a user it would be amazing to be able to integrate deepseek ocr as an external service, i.e., via api calls, instead of as a local model as part of the regular pipeline

dolfim-ibm · 2025-12-05T15:01:59Z

@rafaeltuelho not sure if that aligns with what @dolfim-ibm refers to: as a user it would be amazing to be able to integrate deepseek ocr as an external service, i.e., via api calls, instead of as a local model as part of the regular pipeline

@simonschoe Untested, but I think you could already use DeepSeekOCR with the markdown prompt for the VLM API Docling settings. https://docling-project.github.io/docling/examples/vlm_pipeline_api_model/

rafaeltuelho · 2025-12-05T15:17:23Z

The key question we would like to assess is if this model should be exposed as OCR engine or as model in the VLM pipeline.

@dolfim-ibm That's a good question. I remember I have only used/tested the VLM for Picture description. Is it possible to run the VLM pipeline for OCR-based conversion?

I tried to follow the same approach used by other OCR (eg: EasyOcr) already supported in Docling

dolfim-ibm · 2025-12-11T12:31:38Z

docling/datamodel/pipeline_options.py

+        >>> options = DeepSeekOcrOptions(prompt="<image>\\nConvert to markdown.")
+    """
+
+    kind: ClassVar[Literal["deepseecocr"]] = "deepseecocr"


Suggested change

kind: ClassVar[Literal["deepseecocr"]] = "deepseecocr"

kind: ClassVar[Literal["deepseekocr"]] = "deepseekocr"

Please also adapt all the other occurrences of deepseecocr

dolfim-ibm · 2025-12-11T12:37:31Z

docling/models/deepseek_ocr_model.py

+        )
+        self.options: DeepSeekOcrOptions
+
+        self.scale = 3  # multiplier for 72 dpi == 216 dpi


Is this "simply" copied from the other OCR models or is it the preferred value for DeepSeek-OCR?

It is being used in other OCRs as well (EasyOCR and Tesseract)

Yes, but I was wondering if the scale=3 is good for DeepSeek-OCR or maybe it works better with other scaling factor. Usually the other OCR engines carry a value which is "optimal" for them.

dolfim-ibm · 2025-12-11T12:46:27Z

pyproject.toml

+  'transformers (>=4.46.0,<5.0.0)',
+  'torch (>=2.0.0)',
+  'einops',
+  'Pillow (>=10.0.0)',
+  'addict',
+  'easydict',
+  'matplotlib',
+]


removing what seems not to be used/needed

Suggested change

'transformers (>=4.46.0,<5.0.0)',

'torch (>=2.0.0)',

'einops',

'Pillow (>=10.0.0)',

'addict',

'easydict',

'matplotlib',

]

'transformers (>=4.46.0,<5.0.0)',

'torch (>=2.0.0)',

'Pillow (>=10.0.0)',

]

In my tests, DeepSeek-OCR required addict matplotlib easydict to be present in order to process the parser...

dolfim-ibm · 2025-12-11T12:47:48Z

tests/test_e2e_ocr_conversion.py


+    # DeepSeek OCR - requires GPU (CUDA or MPS) and transformers
+    # Only run if explicitly enabled via environment variable
+    # Set DOCLING_TEST_DEEPSEECOCR=true to include DeepSeek-OCR tests


instead of the ENV, could we detect if deepseek-ocr is runnable?

What do you mean by runnable?

Like if we have a CUDA GPU or MPS is available we run otherwise we skip.

Dogacel · 2025-12-11T17:11:44Z

docling/models/deepseek_ocr_model.py

+                _log.error(
+                    "DeepSeek-OCR MPS model incompatibility detected!\n\n"
+                    "The MPS-compatible model 'Dogacel/DeepSeek-OCR-Metal-MPS' uses deprecated "
+                    "transformers APIs (DynamicCache.seen_tokens) that are not compatible with "
+                    "your current transformers version.\n\n"
+                    "This is a known issue with the community-maintained MPS fork.\n"
+                    "See: https://github.com/Dogacel/DeepSeek-OCR-Metal-MPS/issues\n\n"
+                    "Workarounds:\n"
+                    "  1. Use a different OCR engine that supports MPS:\n"
+                    "     - EasyOcrOptions(lang=['en'])\n"
+                    "     - RapidOcrOptions()\n"
+                    "  2. Wait for the MPS model to be updated for newer transformers versions\n"
+                    "  3. Test in an isolated environment with transformers==4.43.4 (not recommended)\n"
+                )


I think this issue is solved. I updated the upstream to support latest transformers.

@Dogacel there seems to be other issues with your MPS version. Now I get the following

DeepSeek-OCR inference failed: 'DynamicCache' object has no attribute 'get_max_length'

I also observed other issues like missing get_usable_length.

Unfortunately this will block the usage of this version of the model in Docling.

@dolfim-ibm Sorry for the late response, I've reproduced this issue, applied the fix and re-tested to confirm it is now working.

To ensure behaviour is consistent, I've copied the classes/functions from the older version of transformers into the model itself, otherwise it required some architectural changes to the existing models (i.e. attention layers shouldn't calculate position embeddings themselves).

I've also tested the performance of model and so far I don't see any obvious issues.

simonschoe · 2025-12-15T19:40:04Z

@rafaeltuelho not sure if that aligns with what @dolfim-ibm refers to: as a user it would be amazing to be able to integrate deepseek ocr as an external service, i.e., via api calls, instead of as a local model as part of the regular pipeline

@simonschoe Untested, but I think you could already use DeepSeekOCR with the markdown prompt for the VLM API Docling settings. https://docling-project.github.io/docling/examples/vlm_pipeline_api_model/

Yes, but then I wouldn't be able to use it tandem with the "classic" docling pipeline (including the layout and table models), would I? I thought the VLM model is used in place of the "classic" pipeline.
What I would prefer is to have the OCR stage in the regular pipeline to call an external service that complements picture images segmented by the layout model with the contained text/characters, in the same way the current local tesseract, easyocr, and rapidocr engine would do, only with higher quality due to the capabilities of the deepseek backbone.

dolfim-ibm

@rafaeltuelho I tried to run the PR locally but I'm not sure it actually works. I always get back an empty file. You can also check it via the CLI with

uv run docling --ocr-engine=deepseekocr tests/data_scanned/ocr_test.pdf

dolfim-ibm · 2025-12-16T15:29:28Z

@rafaeltuelho here is a short summary of what we would like to do for the DeepSeek-OCR support.

Pause for the moment the MPS support (at least until @Dogacel or others can provide a working version with a recent transformers version)
Move the backbone for using the DeepSeek-OCR to a different class which is independent of the "usage" (OCR, etc)
Allow for 3 usages within Docling:
1. In the VLM pipeline. Here we should use the prompt <image>\n<|grounding|>Convert the document to markdown. and parse the special label+location tokens
2. As a "classic" OCR engine. Here we should enforce only the usage of the prompt <image>\n<|grounding|>OCR this image. which returns the text lines (i.e. we don't need the ocr_rect logic as in this PR). However, we already found that the model is not so stable, i.e. sometime it returns the OCR text lines, others it returns the markdown with label+location. As a "classic" OCR we can support only the first version.
3. As a "post-processing" OCR (new to come) which performs Free OCR on the text snippets identified by the layout model

Our plan is to start from your PR and prepare expand on the above points (at least the use cases 1 and 2).

Dogacel · 2025-12-19T05:23:00Z

@rafaeltuelho here is a short summary of what we would like to do for the DeepSeek-OCR support.

1. Pause for the moment the MPS support (at least until @Dogacel or others can provide a working version with a recent transformers version)

2. Move the backbone for using the DeepSeek-OCR to a different class which is independent of the "usage" (OCR, etc)

3. Allow for 3 usages within Docling:
   
   1. In the VLM pipeline. Here we should use the prompt `<image>\n<|grounding|>Convert the document to markdown.` and parse the special label+location tokens
   2. As a "classic" OCR engine. Here we should enforce only the usage of the prompt `<image>\n<|grounding|>OCR this image.` which returns the text lines (i.e. we don't need the ocr_rect logic as in this PR). However, we already found that the model is not so stable, i.e. sometime it returns the OCR text lines, others it returns the markdown with label+location. As a "classic" OCR we can support only the first version.
   3. As a "post-processing" OCR (new to come) which performs Free OCR on the text snippets identified by the layout model

Our plan is to start from your PR and prepare expand on the above points (at least the use cases 1 and 2).

For issue number 1,

I've reproduced this issue, applied the fix and re-tested to confirm it is now working.

https://huggingface.co/Dogacel/DeepSeek-OCR-Metal-MPS/commit/8ebdc7bf465194ce57bcffc2a718b1574b2da640

Let me know if you face any issues that I might have missed.

dolfim-ibm · 2025-12-19T08:57:15Z

@rafaeltuelho here is a short summary of what we would like to do for the DeepSeek-OCR support.
1. Pause for the moment the MPS support (at least until @Dogacel or others can provide a working version with a recent transformers version)

2. Move the backbone for using the DeepSeek-OCR to a different class which is independent of the "usage" (OCR, etc)

3. Allow for 3 usages within Docling:
   
   1. In the VLM pipeline. Here we should use the prompt `<image>\n<|grounding|>Convert the document to markdown.` and parse the special label+location tokens
   2. As a "classic" OCR engine. Here we should enforce only the usage of the prompt `<image>\n<|grounding|>OCR this image.` which returns the text lines (i.e. we don't need the ocr_rect logic as in this PR). However, we already found that the model is not so stable, i.e. sometime it returns the OCR text lines, others it returns the markdown with label+location. As a "classic" OCR we can support only the first version.
   3. As a "post-processing" OCR (new to come) which performs Free OCR on the text snippets identified by the layout model
Our plan is to start from your PR and prepare expand on the above points (at least the use cases 1 and 2).
For issue number 1,

I've reproduced this issue, applied the fix and re-tested to confirm it is now working.

https://huggingface.co/Dogacel/DeepSeek-OCR-Metal-MPS/commit/8ebdc7bf465194ce57bcffc2a718b1574b2da640

Let me know if you face any issues that I might have missed.

Great. Looking forward to give it a try!

@Dogacel do you think we could use your model fork also for the CUDA environments? I think the official repo is behind with transformers support and they original team doesn't seem to provide any update/fix. This could also be very interesting (once released) huggingface/transformers#41797.

Dogacel · 2025-12-19T17:02:24Z

@rafaeltuelho here is a short summary of what we would like to do for the DeepSeek-OCR support.
1. Pause for the moment the MPS support (at least until @Dogacel or others can provide a working version with a recent transformers version)

2. Move the backbone for using the DeepSeek-OCR to a different class which is independent of the "usage" (OCR, etc)

3. Allow for 3 usages within Docling:
   
   1. In the VLM pipeline. Here we should use the prompt `<image>\n<|grounding|>Convert the document to markdown.` and parse the special label+location tokens
   2. As a "classic" OCR engine. Here we should enforce only the usage of the prompt `<image>\n<|grounding|>OCR this image.` which returns the text lines (i.e. we don't need the ocr_rect logic as in this PR). However, we already found that the model is not so stable, i.e. sometime it returns the OCR text lines, others it returns the markdown with label+location. As a "classic" OCR we can support only the first version.
   3. As a "post-processing" OCR (new to come) which performs Free OCR on the text snippets identified by the layout model
Our plan is to start from your PR and prepare expand on the above points (at least the use cases 1 and 2).
For issue number 1,
I've reproduced this issue, applied the fix and re-tested to confirm it is now working.
https://huggingface.co/Dogacel/DeepSeek-OCR-Metal-MPS/commit/8ebdc7bf465194ce57bcffc2a718b1574b2da640
Let me know if you face any issues that I might have missed.
Great. Looking forward to give it a try!

@Dogacel do you think we could use your model fork also for the CUDA environments? I think the official repo is behind with transformers support and they original team doesn't seem to provide any update/fix. This could also be very interesting (once released) huggingface/transformers#41797.

For CUDA support, I haven't tested it as I didn't have a CUDA device until recently. Most likely it should work, because my patch is a combination of porting old attention code into deepseek (shared between CPU/CUDA/MPS) + updating cache logic.

However, I now have access to an RTX 4080 and I can test it once I have some free time this weekend.

Once transformers adds support, I agree that we won't really need my patch once that is merged. But so far I see an idle PR and I would rather maintain my fork until everyone gets access, I know llama.cpp folks used it as reference during their implementation ggml-org/llama.cpp#16676.

molbap · 2025-12-22T11:24:29Z

Hey! Pablo from transformers here. The deepseekOCR PR will be part of the v5 release so it's definitely going to be merged in the coming weeks, for information!

rafaeltuelho · 2025-12-22T20:22:54Z

@rafaeltuelho I tried to run the PR locally but I'm not sure it actually works. I always get back an empty file. You can also check it via the CLI with
uv run docling --ocr-engine=deepseekocr tests/data_scanned/ocr_test.pdf

I see the issue. Will fix it.

- Add DeepSeekOcrModel with automatic device detection (CUDA → MPS → Error) - Add DeepSeekOcrOptions for configuring the OCR engine - Support CUDA with bfloat16 and flash_attention_2 (optimal performance) - Support MPS (Apple Silicon) with float16 and eager attention (requires PyTorch 2.7.0+) - Auto-switch to MPS-compatible model (Dogacel/DeepSeek-OCR-Metal-MPS) on Apple Silicon - Add mock-based unit tests for CI coverage without GPU hardware - Update E2E tests with DOCLING_TEST_DEEPSEECOCR environment variable guard Note: MPS support requires PyTorch 2.7.0+ for aten::_upsample_bicubic2d_aa operator. See: https://github.com/Dogacel/DeepSeek-OCR-Metal-MPS/discussions Signed-off-by: Rafael T. C. Soares <[email protected]>

rafaeltuelho · 2025-12-22T20:28:41Z

I tested again and was able to convert the old_newspaper.pdf sample

uv run docling --ocr-engine=deepseekocr tests/data_scanned/old_newspaper.png

Dogacel · 2025-12-24T00:22:29Z

@rafaeltuelho here is a short summary of what we would like to do for the DeepSeek-OCR support.
1. Pause for the moment the MPS support (at least until @Dogacel or others can provide a working version with a recent transformers version)

2. Move the backbone for using the DeepSeek-OCR to a different class which is independent of the "usage" (OCR, etc)

3. Allow for 3 usages within Docling:
   
   1. In the VLM pipeline. Here we should use the prompt `<image>\n<|grounding|>Convert the document to markdown.` and parse the special label+location tokens
   2. As a "classic" OCR engine. Here we should enforce only the usage of the prompt `<image>\n<|grounding|>OCR this image.` which returns the text lines (i.e. we don't need the ocr_rect logic as in this PR). However, we already found that the model is not so stable, i.e. sometime it returns the OCR text lines, others it returns the markdown with label+location. As a "classic" OCR we can support only the first version.
   3. As a "post-processing" OCR (new to come) which performs Free OCR on the text snippets identified by the layout model
Our plan is to start from your PR and prepare expand on the above points (at least the use cases 1 and 2).
For issue number 1,
I've reproduced this issue, applied the fix and re-tested to confirm it is now working.
https://huggingface.co/Dogacel/DeepSeek-OCR-Metal-MPS/commit/8ebdc7bf465194ce57bcffc2a718b1574b2da640
Let me know if you face any issues that I might have missed.
Great. Looking forward to give it a try!

@Dogacel do you think we could use your model fork also for the CUDA environments? I think the official repo is behind with transformers support and they original team doesn't seem to provide any update/fix. This could also be very interesting (once released) huggingface/transformers#41797.

I've tested my fix and it seems to work fine.

Just wondering if you are still open to it considering transformers team is soon going to publish the model? Maybe I can update my huggingface fork with that model to ensure a consistent implementation.

rafaeltuelho force-pushed the feature/deepseek-ocr-integration branch from 74b3bcd to 076c3ad Compare December 4, 2025 02:02

rafaeltuelho force-pushed the feature/deepseek-ocr-integration branch from 076c3ad to 4e93020 Compare December 4, 2025 21:16

dolfim-ibm requested changes Dec 11, 2025

View reviewed changes

Dogacel reviewed Dec 11, 2025

View reviewed changes

rafaeltuelho force-pushed the feature/deepseek-ocr-integration branch 2 times, most recently from bb7b8ff to a748add Compare December 15, 2025 15:37

dolfim-ibm requested changes Dec 16, 2025

View reviewed changes

rafaeltuelho force-pushed the feature/deepseek-ocr-integration branch from a748add to 33c6056 Compare December 22, 2025 20:27

	kind: ClassVar[Literal["deepseecocr"]] = "deepseecocr"
	kind: ClassVar[Literal["deepseekocr"]] = "deepseekocr"

feat: Add DeepSeek-OCR integration #2721

Are you sure you want to change the base?

feat: Add DeepSeek-OCR integration #2721

Conversation

rafaeltuelho commented Dec 4, 2025

Uh oh!

github-actions bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Dec 4, 2025

Uh oh!

mergify bot commented Dec 4, 2025

Merge Protections

🟢 Enforce conventional commit

Uh oh!

codecov bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rafaeltuelho commented Dec 4, 2025

Uh oh!

dolfim-ibm commented Dec 5, 2025

Uh oh!

simonschoe commented Dec 5, 2025

Uh oh!

dolfim-ibm commented Dec 5, 2025

Uh oh!

rafaeltuelho commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simonschoe commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dolfim-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

dolfim-ibm commented Dec 16, 2025

Uh oh!

Dogacel commented Dec 19, 2025

Uh oh!

dolfim-ibm commented Dec 19, 2025

Uh oh!

Dogacel commented Dec 19, 2025

Uh oh!

molbap commented Dec 22, 2025

Uh oh!

rafaeltuelho commented Dec 22, 2025

Uh oh!

rafaeltuelho commented Dec 22, 2025

Uh oh!

Dogacel commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

github-actions bot commented Dec 4, 2025 •

edited

Loading

codecov bot commented Dec 4, 2025 •

edited

Loading

rafaeltuelho commented Dec 5, 2025 •

edited

Loading

simonschoe commented Dec 15, 2025 •

edited

Loading