Accepted to WACV 2026
Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre). DIOR is a training-free approach that prompts a Large Vision-Language Model (LVLM) to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding.
DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP, and achieves superior performance compared to methods that require additional training across multiple settings.
| Processor | Description |
|---|---|
DIOREmbeddingProcessor |
DIOR - Extract embeddings from VLM hidden states with KV cache support |
CLIPEmbeddingProcessor |
Extract embeddings from CLIP models |
GenerativeEmbeddingProcessor |
VLM generates text -> Text encoder (Sentence-T5) creates embeddings |
InDiReCTEmbeddingProcessor |
CLIP + DimRedRecon transformation (baseline method) |
uv syncFor development tools:
uv sync --group devDownload the datasets used in our experiments and place them under datasets/. Each dataset directory must contain image files directly (no class subdirectories) — evaluate.py recovers labels from filenames / CSV metadata.
bash scripts/download_datasets.sh
# or target a custom location:
bash scripts/download_datasets.sh /path/to/datasetsThis automatically downloads and flattens cub200 and movie_posters. For datasets that require authentication or manual terms-of-use acceptance (cars196, deepfashion), the script prints the steps you need to follow.
- Synthetic Cars (
synthetic_cars) — the original download link (konstantinkobs/DML-analysis) is no longer accessible. You will need to obtain the images from another mirror and place them (flat) underdatasets/synthetic_cars/. - Stanford Cars (
cars196) — Cars196 on Kaggle (requires a Kaggle account). - DeepFashion In-Shop (
deepfashion) — InShop (requires terms acceptance). Downloadimg.zipfrom the In-Shop folder on Google Drive, place it atdatasets/deepfashion/img.zip, and re-runscripts/download_datasets.sh— the script will extract and flatten the images to matchmetadata/deepfashon_metadata.csv(4,000 test images). - CUB-200-2011 (
cub200) — CUB200 (public, fully automated). - Movie Posters (
movie_posters) — Movie Posters (public, fully automated).
After setup, the directory structure should look like:
datasets/
├── synthetic_cars/
├── cub200/
├── movie_posters/
├── deepfashion/
└── cars196/
uv run python inference.py \
--model_id Qwen/Qwen2.5-VL-7B-Instruct \
--dataset_name cub200 \
--prompt_type describe \
--num_layer -1 \
--num_token -1uv run python inference.py \
--model_id Qwen/Qwen2.5-VL-7B-Instruct \
--dataset_name cub200 \
--prompt_type describe \
--use_cacheuv run python inference.py \
--model_id openai/clip-vit-large-patch14 \
--dataset_name cub200 \
--prompt_type describeuv run python inference.py \
--model_id Qwen/Qwen2.5-VL-7B-Instruct \
--dataset_name cub200 \
--prompt_type describe \
--generative \
--text_encoder_id sentence-transformers/sentence-t5-baseInDiReCT is an optional external dependency. Clone the original implementation
before running --indirect:
uv run poe indirect-setupIf the upstream repository changes, override the clone source:
INDIRECT_REPO_URL=https://github.com/<owner>/<repo>.git uv run poe indirect-setupuv run python inference.py \
--model_id openai/clip-vit-large-patch14 \
--dataset_name cub200 \
--indirect \
--indirect_num_components 128uv run python evaluate.py \
--embedding_dir ./embeddings \
--setting_pattern "cub200-*"uv run ruff check .
uv run mypysynthetic_cars- Synthetic car images (car_model, car_color, background_color)cars196- Stanford Cars dataset (car_model only; themanufacturerandcar_typeaspects reported in the paper are not evaluable here because the corresponding metadata is not publicly available from prior work)cub200- CUB-200-2011 bird dataset (bird_species)movie_posters- Movie poster dataset (genre, country)deepfashion- DeepFashion dataset (clothing_category, texture, fabric, fit)
conditional-image-embeddings/
├── inference.py # Main entry point
├── evaluate.py # Evaluation script
├── processors/
│ ├── base.py # BaseEmbeddingProcessor
│ ├── clip.py # CLIPEmbeddingProcessor
│ ├── dior.py # DIOREmbeddingProcessor (Proposed)
│ ├── generative.py # GenerativeEmbeddingProcessor
│ └── indirect.py # InDiReCTEmbeddingProcessor
├── entities/
│ ├── config.py # InferenceConfig, TextEncoder
│ ├── dataset.py # Dataset configurations
│ ├── prompt.py # Prompt templates
│ ├── embedding.py # EmbeddingOutput
│ └── indirect_texts.py # InDiReCT text descriptions
├── utils/
│ ├── get_models.py # Model loading utilities
│ └── utils.py # Helper functions
├── utils/indirect/ # Optional external InDiReCT clone (gitignored)
│ └── dimensionality_reduction.py # DimRedRecon from InDiReCT
├── metadata/ # Dataset metadata files
├── datasets/ # Image datasets
└── embeddings/ # Output embeddings
If you find this work useful, please cite our paper:
@InProceedings{Kawarada_2026_WACV,
author = {Kawarada, Masayuki and Yamada, Kosuke and Tejero-de-Pablos, Antonio and Inoue, Naoto},
title = {Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
month = {March},
year = {2026},
pages = {7636-7646}
}- InDiReCT: Indirect Dimensionality Reduction for Conditional Embeddings (paper)
- CLIP: Learning Transferable Visual Models From Natural Language Supervision
- Sentence-T5: Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models
- Qwen2.5-VL: Qwen2.5-VL Technical Report
This project is released under the MIT License.