Skip to content

Conversation

@JoeLeelyf
Copy link
Contributor

This PR adds support for the SArena_MINI benchmark for SVG understanding, editing, and generation.

SArena is a benchmark for evaluating MLLMs on SVG-related tasks (icons, illustrations, chemistry diagrams, etc.) across understanding, editing, Text-to-SVG (T2SVG), and Image-to-SVG (I2SVG).
SArena_MINI is a small subset sampled from SArena-Icon / SArena-Illustration / SArena-Chemistry, designed for quick validation and debugging of SVG-related capabilities while keeping the original task definitions and metrics.

What this PR does

  • Adds SArena_MINI benchmark configuration and dataset support to VLMEvalKit.
  • Defines SArena_MINI as a sampled subset of:
    • SArena-Icon: understanding, editing, T2SVG, I2SVG.
    • SArena-Illustration: T2SVG, I2SVG.
    • SArena-Chemistry: T2SVG, I2SVG.
  • Reuses the official SArena evaluation metrics for SVG tasks:
    • Understanding: O (overall), C (color), G (geometry), Q (quantity), S (semantic).
    • Editing / I2SVG: DINO, SSIM, LPIPS, PSNR.
    • T2SVG: FID, FID-C, CLIP-T2I, CLIP-I2I, token length.
  • Provides example configs and evaluation scripts for quickly running SArena_MINI in VLMEvalKit.
  • Verifies the integration by reproducing InternVL3-8B performance on SArena / SArena_MINI.

Experimental results (InternVL3-8B)

All experiments below use the official SArena metrics.
We compare the original paper numbers (Ori), the official benchmark implementation (Use-Ori-Bench), and this PR’s VLMEvalKit implementation (Ours).

SArena-Icon

Understanding (accuracy)
Setting O ↑ C ↑ G ↑ Q ↑ S ↑
Ori 59.5 79.1 59.3 38.2 61.3
Use-Ori-Bench 60.775 80.9 63.0 44.1 55.1
Ours 59.7 80.9 60.9 44.6 57.14
Editing (rendered-image metrics)
Setting DINO ↑ SSIM ↑ LPIPS ↓ PSNR ↑
Ori 0.921 0.761 0.170 29.615
Use-Ori-Bench 0.862 0.637 0.219 25.550
Ours 0.902 0.702 0.196 24.790
Text-to-SVG (T2SVG)
Setting FID ↓ FID-C ↓ CLIP-T2I ↑ CLIP-I2I ↑
Ori 23.061 14.303 21.897 71.45
Use-Ori-Bench 120.82 23.700 21.68 72.21
Ours 124.84 25.070 21.09 70.64
Image-to-SVG (I2SVG)
Setting DINO ↑ SSIM ↑ LPIPS ↓ PSNR ↑
Ori 0.812 0.557 0.361 7.22
Use-Ori-Bench 0.813 0.588 0.358 7.763
Ours 0.785 0.516 0.378 6.458

SArena-Illustration

Text-to-SVG (T2SVG)
Setting FID ↓ FID-C ↓ CLIP-T2I ↑ CLIP-I2I ↑ Tokens
Ori 36.736 25.682 18.493 61.964 493
Use-Ori-Bench 154.02 44.850 17.79 60.9 464
Ours 161.98 49.190 16.89 59.96 1003
Image-to-SVG (I2SVG)
Setting DINO ↑ SSIM ↑ LPIPS ↓ PSNR ↑ Tokens
Ori 0.772 0.569 0.397 8.542 716
Use-Ori-Bench 0.774 0.546 0.385 8.27 492.4
Ours 0.719 0.309 0.409 5.07 1863

SArena-Chemistry

Text-to-SVG (T2SVG)
Setting FID ↓ FID-C ↓ CLIP-I2I ↑ DINO ↑ SSIM ↑ LPIPS ↓ PSNR ↑
Ori 33.613 61.675 56.856 0.865 0.783 0.203 13.84
Use-Ori-Bench 114.63 65.220 57.51 0.871 0.807 0.193 14.49
Ours 137.13 76.880 53.13 0.791 0.791 0.256 9.317

The differences between the original paper, the official benchmark implementation, and this PR’s results are within a reasonable range. This suggests that the SArena_MINI subset and its VLMEvalKit integration provide a reliable and efficient way to validate SVG-related tasks.

References

Version

  • torch: 2.6.0
  • flash_attn: 2.7.4.post1
  • transformers: 4.49.0

@mzr1996 mzr1996 merged commit 18ce87c into open-compass:main Dec 11, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants