-
Notifications
You must be signed in to change notification settings - Fork 31.2k
[Ernie 4.5] Ernie VL models
#39585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vasqu
wants to merge
154
commits into
huggingface:main
Choose a base branch
from
vasqu:ernie_vl
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[Ernie 4.5] Ernie VL models
#39585
Changes from 141 commits
Commits
Show all changes
154 commits
Select commit
Hold shift + click to select a range
339a89c
init
vasqu eb9d6b4
lets tmp disable cache init
vasqu 4260a62
some initial remote code version, for local inference use remote proc…
vasqu 26f06a2
first cleanups
vasqu b3d999a
need to do this slowly
vasqu 1e190e2
more attention cleanup
vasqu b44101d
llama like text attention
vasqu b38e048
generates different text but cos and sin tensors are always close - 1e-8
vasqu fcf3903
another round of rope fixups
vasqu 62206ee
yea, gonna check tomorrow cant cheat w freqs for whatever reason
vasqu 7e7d8e4
NOTE: last time where comp with old rope
vasqu fca8fba
rope cleanup
vasqu db80573
more rope
vasqu e82297b
somewhat clean 3d rope with attn - sin / cos has very small diffs to …
vasqu 8540938
new rope type
vasqu dfe6714
style
vasqu 1153291
attempt at moe, gonna need a deeper look
vasqu 39c77ef
cleanup gate
vasqu aadf423
more cleaning
vasqu 096529d
NOTE remove attempt at moe for now
vasqu 3820cc6
another round of cleanups
vasqu b25a458
whoops
vasqu 04a7882
we back boys, reattempting moe start
vasqu b16737f
moe should be done with this
vasqu 30acfda
cleanup
vasqu 5b6efdd
more cleanup
vasqu 46efff9
nits
vasqu 7303a31
add conversion and adjust code accordingly
vasqu cba549f
fix
vasqu add956e
Merge branch 'main' into ernie_vl
vasqu 01187e2
make moe copyable as far as we can
vasqu d5f7568
cleanup conversion a bit, next config
vasqu 41e6cfc
cleanup config part1
vasqu 5610549
small removal of unused things
vasqu 414fb20
config conversion, rope type doesnt get loaded tho...
vasqu fe3e6d7
fix rope
vasqu 20c2c22
last hardcoded values
vasqu ccea132
remove unnecessary class
vasqu d178a02
starting to make copies available for vision, vision rope refactor to…
vasqu e797a0a
vl rope changes
vasqu 8ff1dea
simplify variable resolution resampler
vasqu f247b64
nit
vasqu 5e2eca3
conversion update
vasqu 73e7c79
more conversions, standardization, and big dtype fix!
vasqu 1d2deac
remove some docs (tmp), focus on code for me
vasqu cfe0b4d
oops
vasqu b643da6
nit
vasqu 6869aa9
fixup embeddings, add todos
vasqu b7363b9
more cleanup
vasqu c53b080
more cleanup, next caching changes
vasqu 60e1073
revert fp16, internally discussed weights are supposed to be bf16
vasqu de04496
fix rope (a bit), prepare cache logic changes
vasqu ba0e2cd
more prep for cache
vasqu e38c511
cache class is used, fixup some flags
vasqu 46cdb54
modular refactor
vasqu b004f0c
partially docstrings, docs, etc
vasqu 777fe1f
cleaner order
vasqu 8cd3bbe
nit
vasqu 2446afa
fix config
vasqu 43c9dfd
remove old artefacts/todos
vasqu 41a919a
Merge branch 'main' into ernie_vl
vasqu 3423440
sync with remote and add some todos for orientation
vasqu 659ae74
remove img process dep on modeling code
vasqu 9d1233e
image processor with a few diffs highlighted to copy from maybe
vasqu e4d0078
fast img processor version
vasqu 76d9a6a
modular image processors
vasqu 79dbeeb
convert tokenizer to have dedicated video placeholder token
vasqu 4a77472
before i forget
vasqu 910e86c
Merge branch 'main' into ernie_vl
vasqu 3744960
a modular bug :/
vasqu a552294
more processor things, some modular adjustments
vasqu 1316c18
remove dependency on token type ids
vasqu 2e23c08
position ids ala qwen vl and modular is bugging
vasqu 5233495
fixup some inheritances + nits
vasqu 0dc7d15
token type ids
vasqu 476bfeb
moe loss, docs, simplify pos ids
vasqu 2284067
align some feature getters
vasqu d24eb0f
docs
vasqu 28181f2
rename conv -> merge aka our naming convention
vasqu bf00c6e
style
vasqu a697dde
fixup tokenizer class in auto
vasqu 0843b21
no more nn sequential
vasqu ff9746d
fix chat template, fix tokenizer conversion, modular bug
vasqu 54d3f97
remove this
vasqu 1f35405
remove old deps (from the remote processor)
vasqu 0db2d94
Merge branch 'main' into ernie_vl
vasqu 4031a4b
whoops
vasqu 542e003
argh
vasqu 3bf78d2
todo, restarting progress tomorrow
vasqu 32ef56a
fast image processor changes output, keeping slow for now
vasqu 1bf1685
NOTE rm debugging code on processor conversion
vasqu 0af0001
first complete conversion script version, todo on whether to use fast…
vasqu 3797816
config docs
vasqu b92195c
image processor tests, only kept to images as videos need different r…
vasqu f944449
Merge branch 'main' into ernie_vl
vasqu ea1b000
processor tests
vasqu e59ecd9
first ish version for video processor, very much WIP tho
vasqu e4bfd31
Merge branch 'main' into ernie_vl
vasqu 6e73385
sync with main and all the changes that happened, fix ernie moe bug i…
vasqu f30bab6
mini style fix
vasqu 0e29474
Merge branch 'main' into ernie_vl
vasqu 33a0319
vid processor is properly separated now
vasqu 7204270
make vid processor its own thing
vasqu 0631ad9
style
vasqu dbc6ca0
video processing and cleanups, img processing done, processing needs …
vasqu 372f680
readd vid patch fn
vasqu 564b1b2
make 4D RoPE possible if manually passed
vasqu dd930d8
simplify the msg on packing, allow external prep but not internal one
vasqu 5af84db
nit
vasqu 42725b2
revert general changes video utils, make it specific to ernie, fixup …
vasqu 21b5820
vid to auto
vasqu bf3568a
left to check: pos ids (rope) + token type ids
vasqu 52c06bc
move token type ids to processor, fix processor to ernie logic
vasqu 685303c
processor fixes, conversion todo for fast img processor
vasqu 0220efa
fix
vasqu a6b2e83
video processor tests, torch compile does not work due to PIL drawing…
vasqu 1b435b8
fix config consistency
vasqu ad26c7b
style
vasqu f1ef664
wip tests
vasqu 3f4ebe5
fix most tests, 2 failing ones remain
vasqu 05c2039
fix last tests
vasqu cd24355
check
vasqu 23029d9
docs consistency
vasqu afc8fed
fix conversion script, more docs
vasqu 97ccaca
optional drawing on frames, style
vasqu 419884d
add error on compile x draw on frames
vasqu 8015964
fix
vasqu 254a021
fix
vasqu 1eea554
change font loading to hub dep with default font
vasqu f85992b
fix config try 2
vasqu 721d740
fix diff resolution, tests (not fast processor, a100)
vasqu f1765e1
fix test
vasqu 81162d3
style
vasqu d535d64
Merge branch 'main' into ernie_vl
vasqu 9a1a27f
torch 2.9 (fa2 untested, video from 2.6)
vasqu 0c54a1e
raushan's review (part 1)
vasqu db1d948
Update docs/source/en/model_doc/ernie4_5_vl.md
vasqu c1119a0
Pablo's review
vasqu 3ebfb39
style
vasqu 662ba2b
fix device/dtype stuff that is no longer needed
vasqu f9a717c
revert vision property rm, necessary for composite sdpa test
vasqu 36c1950
fixup few smaller things + refactor how we load the font entirely (ba…
vasqu 398314d
remove bc min max pixels --> less modular on processor parts but way …
vasqu b70eb4e
fix fps and add fixme to the inefficient conversion stuff
vasqu 1372646
rope
vasqu 1f72e77
style
vasqu aaa5254
copies and last rope stuff i fogot
vasqu 91bdbb0
revert glm4v copies
vasqu 1613601
fix
vasqu 91ba990
simplify temporal slicing and add more descriptions
vasqu ea569f1
that ":" :cry:
vasqu 7c5af18
Merge branch 'main' into ernie_vl
vasqu 285547a
Merge branch 'main' into ernie_vl
vasqu 6c0d473
fixup init
vasqu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| <!--Copyright 2025 The Qwen Team and The HuggingFace Inc. team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
|
|
||
| <div style="float: right;"> | ||
| <div class="flex flex-wrap space-x-1"> | ||
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat"> | ||
| <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> </div> | ||
| </div> | ||
|
|
||
| # ernie4_5_vl | ||
|
|
||
| ## Overview | ||
|
|
||
| The ernie4_5_vl model was proposed in [<INSERT PAPER NAME HERE>](<INSERT PAPER LINK HERE>) by <INSERT AUTHORS HERE>. | ||
| <INSERT SHORT SUMMARY HERE> | ||
|
|
||
| The abstract from the paper is the following: | ||
|
|
||
| In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios. | ||
|
|
||
| Tips: | ||
|
|
||
| <INSERT TIPS ABOUT MODEL HERE> | ||
|
|
||
| This model was contributed by [INSERT YOUR HF USERNAME HERE](https://huggingface.co/<INSERT YOUR HF USERNAME HERE>). | ||
vasqu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>). | ||
|
|
||
|
|
||
| ## Ernie4_5_VLConfig | ||
|
|
||
| [[autodoc]] Ernie4_5_VLConfig | ||
|
|
||
| ## Ernie4_5_VLTextConfig | ||
|
|
||
| [[autodoc]] Ernie4_5_VLTextConfig | ||
|
|
||
| ## Ernie4_5_VLVisionConfig | ||
|
|
||
| [[autodoc]] Ernie4_5_VLVisionConfig | ||
|
|
||
| ## Ernie4_5_VLImageProcessor | ||
|
|
||
| [[autodoc]] Ernie4_5_VLImageProcessor | ||
| - preprocess | ||
|
|
||
| ## Ernie4_5_VLImageProcessorFast | ||
|
|
||
| [[autodoc]] Ernie4_5_VLImageProcessorFast | ||
| - preprocess | ||
|
|
||
| ## Ernie4_5_VLVideoProcessor | ||
|
|
||
| [[autodoc]] Ernie4_5_VLVideoProcessor | ||
| - preprocess | ||
|
|
||
| ## Ernie4_5_VLProcessor | ||
|
|
||
| [[autodoc]] Ernie4_5_VLProcessor | ||
|
|
||
| ## Ernie4_5_VLTextModel | ||
|
|
||
| [[autodoc]] Ernie4_5_VLTextModel | ||
| - forward | ||
|
|
||
| ## Ernie4_5_VLVisionTransformerPretrainedModel | ||
|
|
||
| [[autodoc]] Ernie4_5_VLVisionTransformerPretrainedModel | ||
| - forward | ||
|
|
||
| ## Ernie4_5_VLVariableResolutionResamplerModel | ||
|
|
||
| [[autodoc]] Ernie4_5_VLVariableResolutionResamplerModel | ||
| - forward | ||
|
|
||
| ## Ernie4_5_VLModel | ||
|
|
||
| [[autodoc]] Ernie4_5_VLModel | ||
| - forward | ||
|
|
||
| ## Ernie4_5_VLForConditionalGeneration | ||
|
|
||
| [[autodoc]] Ernie4_5_VLForConditionalGeneration | ||
| - forward | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To complete with https://arxiv.org/abs/2510.14528 I suppose
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea keeping one comment open here, I have a TODO note in the PR description 👍