[Benchmark] Support SArena_MINI Benchmark #1353
Merged
+1,715
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds support for the SArena_MINI benchmark for SVG understanding, editing, and generation.
SArena is a benchmark for evaluating MLLMs on SVG-related tasks (icons, illustrations, chemistry diagrams, etc.) across understanding, editing, Text-to-SVG (T2SVG), and Image-to-SVG (I2SVG).
SArena_MINI is a small subset sampled from SArena-Icon / SArena-Illustration / SArena-Chemistry, designed for quick validation and debugging of SVG-related capabilities while keeping the original task definitions and metrics.
What this PR does
Experimental results (InternVL3-8B)
All experiments below use the official SArena metrics.
We compare the original paper numbers (Ori), the official benchmark implementation (Use-Ori-Bench), and this PR’s VLMEvalKit implementation (Ours).
SArena-Icon
Understanding (accuracy)
Editing (rendered-image metrics)
Text-to-SVG (T2SVG)
Image-to-SVG (I2SVG)
SArena-Illustration
Text-to-SVG (T2SVG)
Image-to-SVG (I2SVG)
SArena-Chemistry
Text-to-SVG (T2SVG)
The differences between the original paper, the official benchmark implementation, and this PR’s results are within a reasonable range. This suggests that the SArena_MINI subset and its VLMEvalKit integration provide a reliable and efficient way to validate SVG-related tasks.
References
Version