Skip to content

Verm1lion/ColTurk-VDR

Repository files navigation

ColTurk-VDR

🚧 WIP — v0.1-pre · Day-1 baseline ready · 2026-05-18

ColBERT-style late-interaction visual document retriever for Turkish enterprise documents (e-invoices, KYC, legal, financial). Built on Qwen3-VL-4B-Instruct (Apache-2.0) via colpali_engine.models.ColQwen3 (Qwen3-VL native, verified 2026-05-18).

License: MIT Base Model: Apache-2.0 Status: Day-1 baseline ready Submission: MTEB v2


What is this?

EN: ColTurk-VDR brings ColBERT-style multi-vector late-interaction retrieval to Turkish enterprise documents. Trained on a single A100 80GB ($0 marginal cost), it targets the ViDoRe V3 MTEB leaderboard and ships with ViDoRe-TR, the first public Turkish split for visual document retrieval. MIT-licensed code, Apache-2.0 base model, KVKK-compliant synthetic data pipeline included.

TR: ColTurk-VDR, Türkçe kurumsal belgeler (e-fatura, KYC, hukuki, finansal) için ColBERT-tarzı çok-vektörlü geç-etkileşim retrieval modeli. Tek A100 80GB üzerinde ($0 marjinal maliyet) eğitilir, ViDoRe V3 MTEB leaderboard'ı birincil hedef, ViDoRe-TR Türkçe community split yan ürün olarak release edilir.

Why this matters

Türkiye'deki e-fatura zorunluluğu (GİB 589 Tebliği, 1 Ocak 2026), KYC pipeline'ları (Akbank Lab, Garanti BBVA, Yapı Kredi, iyzico, Papara) ve KOBİ ERP entegrasyonları (Logo, Mikro, KoçSistem) — hepsi Türkçe doküman üzerinde sayfa düzeyinde retrieval ihtiyacı. Klasik OCR → text retrieval pipeline'ı Türkçe karakter (ğ, ş, ı, ç, ö, ü), tablo yapısı ve layout bilgisini kaybediyor. ColTurk-VDR doğrudan görsel patch embedding'leri üzerinde late-interaction yapar — OCR'siz, layout-aware, çok-dilli.

Hedef Leaderboard'lar

Hedef URL Mevcut SOTA Hedefimiz
ViDoRe V3 MTEB (primary) HF Space NVIDIA Nemotron ColEmbed-8B-v2 NDCG@10 63.42 Top-15 (NDCG@10 52-58)
MTEB Multilingual Retrieval (Turkish) MTEB Açık #1 açık-ağırlık Türkçe
ViDoRe-TR community split community PR Henüz yok Yaratan + maintainer

Quick start (planned, post-v0.1)

docker compose up
# → http://localhost:8000 (FastAPI)
# → http://localhost:8501 (Streamlit dashboard)
# → http://localhost:6333 (Qdrant)

Architecture

Image (page) → Qwen3-VL-4B Vision → Patch tokens → 128-dim projection
                                                          ↓
Query (text) → Qwen3-VL-4B Text → Query tokens → 128-dim projection
                                                          ↓
                                MaxSim(q, d) = Σᵢ maxⱼ (qᵢ · dⱼ)
                                                          ↓
                                                NDCG@10, recall@k, mAP

Status & Roadmap

  • Day-0 setup (2026-05-18) — repo + scaffolding
  • Day 1: ColQwen2.5-3B-multilingual zero-shot baseline on ViDoRe V3
  • Day 2-4: Turkish corpus build (mevzuat.gov.tr + KAP + YÖK + synthetic e-fatura)
  • Day 5-7: Stage-1 training (English-heavy ColPali recipe)
  • Day 8: First ViDoRe V3 baseline submission
  • Day 9-10: Stage-2 fine-tune (Turkish-heavy 5:1)
  • Day 11: Reranker + model merging
  • Day 12: Productization (FastAPI + Qdrant + Streamlit + Docker Compose)
  • Day 13-14: Demo video + LinkedIn launch + ViDoRe-TR community PR

Detailed plan: see docs/.

Lisans / License

  • Code: MIT — see LICENSE
  • Datasets & base models: see LICENSE-NOTICE.md for the full matrix
  • KVKK: No real PII used. All Turkish data is synthetic (Faker-tr with invalidated TCKN/IBAN) or public domain (mevzuat.gov.tr, KAP, YÖK). See docs/KVKK-compliance.md (post-v0.1).

Author

Mert Karatay — AI & Network Security Engineer, İstanbul. Background: ConvNeXt-Tiny + ArcFace + Triplet metric learning (TÜBİTAK 2209-A signature verification, CEDAR 0.9844 AUC). HuggingFace: Verm1ion.

Citation

@misc{karatay2026colturkvdr,
  author = {Karatay, Mert},
  title  = {ColTurk-VDR: A Late-Interaction Visual Document Retriever for Turkish Enterprise Documents},
  year   = {2026},
  url    = {https://github.com/Verm1lion/ColTurk-VDR}
}

Acknowledgments

  • ColPali — late-interaction visual retrieval framework (Faysse et al., ICLR 2025)
  • NVIDIA Nemotron ColEmbed v2 — model merging + hard negative mining recipe
  • Qwen3-VL — Apache-2.0 base vision-language model
  • Trendyol Tech — ConvNeXt+ArcFace ecosystem inspiration (1B image vector search blog)

About

No description, website, or topics provided.

Resources

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE-NOTICE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages