Skip to content

Add blog post: Assessing 331 Arabic NLP Datasets#3273

Open
Salah-Sal wants to merge 1 commit into
huggingface:mainfrom
Salah-Sal:Salah/arabic-dataset-quality
Open

Add blog post: Assessing 331 Arabic NLP Datasets#3273
Salah-Sal wants to merge 1 commit into
huggingface:mainfrom
Salah-Sal:Salah/arabic-dataset-quality

Conversation

@Salah-Sal
Copy link
Copy Markdown

Summary

  • First large-scale quality audit of 331 Arabic NLP datasets from the Masader catalog
  • Automated pipeline using Claude (Sonnet) in isolated Docker containers, inspecting up to 500 samples per dataset
  • Scores across 7 quality dimensions with statistical evidence (duplication, encoding, text lengths)
  • Results: 35 excellent, 200 good, 79 acceptable, 17 poor (mean score 65.3/100)

Links

Files

  • arabic-dataset-quality.md — Blog post (~1,250 words)
  • assets/arabic-dataset-quality/thumbnail.png — Thumbnail (1300x709, 1.3 MB)
  • _blog.yml — Added entry at the end

First large-scale quality audit of Arabic NLP datasets from the
Masader catalog. Includes automated pipeline using Claude (Sonnet)
in Docker containers, 7 quality dimensions, and statistical analysis
of up to 500 samples per dataset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant