Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

Accepted to WACV 2026

Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre). DIOR is a training-free approach that prompts a Large Vision-Language Model (LVLM) to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding.

Overview

DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP, and achieves superior performance compared to methods that require additional training across multiple settings.

Embedding Processors

Processor	Description
`DIOREmbeddingProcessor`	DIOR - Extract embeddings from VLM hidden states with KV cache support
`CLIPEmbeddingProcessor`	Extract embeddings from CLIP models
`GenerativeEmbeddingProcessor`	VLM generates text -> Text encoder (Sentence-T5) creates embeddings
`InDiReCTEmbeddingProcessor`	CLIP + DimRedRecon transformation (baseline method)

Installation

uv sync

For development tools:

uv sync --group dev

Dataset Preparation

Download the datasets used in our experiments and place them under datasets/. Each dataset directory must contain image files directly (no class subdirectories) — evaluate.py recovers labels from filenames / CSV metadata.

One-shot download script

bash scripts/download_datasets.sh
# or target a custom location:
bash scripts/download_datasets.sh /path/to/datasets

This automatically downloads and flattens cub200 and movie_posters. For datasets that require authentication or manual terms-of-use acceptance (cars196, deepfashion), the script prints the steps you need to follow.

Dataset sources

Synthetic Cars (synthetic_cars) — the original download link (konstantinkobs/DML-analysis) is no longer accessible. You will need to obtain the images from another mirror and place them (flat) under datasets/synthetic_cars/.
Stanford Cars (cars196) — Cars196 on Kaggle (requires a Kaggle account).
DeepFashion In-Shop (deepfashion) — InShop (requires terms acceptance). Download img.zip from the In-Shop folder on Google Drive, place it at datasets/deepfashion/img.zip, and re-run scripts/download_datasets.sh — the script will extract and flatten the images to match metadata/deepfashon_metadata.csv (4,000 test images).
CUB-200-2011 (cub200) — CUB200 (public, fully automated).
Movie Posters (movie_posters) — Movie Posters (public, fully automated).

After setup, the directory structure should look like:

datasets/
├── synthetic_cars/
├── cub200/
├── movie_posters/
├── deepfashion/
└── cars196/

Usage

DIOR (Proposed Method)

uv run python inference.py \
    --model_id Qwen/Qwen2.5-VL-7B-Instruct \
    --dataset_name cub200 \
    --prompt_type describe \
    --num_layer -1 \
    --num_token -1

With KV Cache (faster for multiple aspects)

uv run python inference.py \
    --model_id Qwen/Qwen2.5-VL-7B-Instruct \
    --dataset_name cub200 \
    --prompt_type describe \
    --use_cache

CLIP Embedding (Baseline)

uv run python inference.py \
    --model_id openai/clip-vit-large-patch14 \
    --dataset_name cub200 \
    --prompt_type describe

Generative Mode (VLM -> Text -> Sentence-T5)

uv run python inference.py \
    --model_id Qwen/Qwen2.5-VL-7B-Instruct \
    --dataset_name cub200 \
    --prompt_type describe \
    --generative \
    --text_encoder_id sentence-transformers/sentence-t5-base

InDiReCT Baseline

InDiReCT is an optional external dependency. Clone the original implementation before running --indirect:

uv run poe indirect-setup

If the upstream repository changes, override the clone source:

INDIRECT_REPO_URL=https://github.com/<owner>/<repo>.git uv run poe indirect-setup

uv run python inference.py \
    --model_id openai/clip-vit-large-patch14 \
    --dataset_name cub200 \
    --indirect \
    --indirect_num_components 128

Evaluation

uv run python evaluate.py \
    --embedding_dir ./embeddings \
    --setting_pattern "cub200-*"

Development

uv run ruff check .
uv run mypy

Supported Datasets

synthetic_cars - Synthetic car images (car_model, car_color, background_color)
cars196 - Stanford Cars dataset (car_model only; the manufacturer and car_type aspects reported in the paper are not evaluable here because the corresponding metadata is not publicly available from prior work)
cub200 - CUB-200-2011 bird dataset (bird_species)
movie_posters - Movie poster dataset (genre, country)
deepfashion - DeepFashion dataset (clothing_category, texture, fabric, fit)

Project Structure

conditional-image-embeddings/
├── inference.py              # Main entry point
├── evaluate.py               # Evaluation script
├── processors/
│   ├── base.py               # BaseEmbeddingProcessor
│   ├── clip.py               # CLIPEmbeddingProcessor
│   ├── dior.py               # DIOREmbeddingProcessor (Proposed)
│   ├── generative.py         # GenerativeEmbeddingProcessor
│   └── indirect.py           # InDiReCTEmbeddingProcessor
├── entities/
│   ├── config.py             # InferenceConfig, TextEncoder
│   ├── dataset.py            # Dataset configurations
│   ├── prompt.py             # Prompt templates
│   ├── embedding.py          # EmbeddingOutput
│   └── indirect_texts.py     # InDiReCT text descriptions
├── utils/
│   ├── get_models.py         # Model loading utilities
│   └── utils.py              # Helper functions
├── utils/indirect/           # Optional external InDiReCT clone (gitignored)
│   └── dimensionality_reduction.py  # DimRedRecon from InDiReCT
├── metadata/                 # Dataset metadata files
├── datasets/                 # Image datasets
└── embeddings/               # Output embeddings

Citation

If you find this work useful, please cite our paper:

@InProceedings{Kawarada_2026_WACV,
    author    = {Kawarada, Masayuki and Yamada, Kosuke and Tejero-de-Pablos, Antonio and Inoue, Naoto},
    title     = {Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {March},
    year      = {2026},
    pages     = {7636-7646}
}

References

InDiReCT: Indirect Dimensionality Reduction for Conditional Embeddings (paper)
CLIP: Learning Transferable Visual Models From Natural Language Supervision
Sentence-T5: Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models
Qwen2.5-VL: Qwen2.5-VL Technical Report

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
entities		entities
metadata		metadata
processors		processors
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
inference.py		inference.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

Overview

Embedding Processors

Installation

Dataset Preparation

One-shot download script

Dataset sources

Usage

DIOR (Proposed Method)

With KV Cache (faster for multiple aspects)

CLIP Embedding (Baseline)

Generative Mode (VLM -> Text -> Sentence-T5)

InDiReCT Baseline

Evaluation

Development

Supported Datasets

Project Structure

Citation

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

Overview

Embedding Processors

Installation

Dataset Preparation

One-shot download script

Dataset sources

Usage

DIOR (Proposed Method)

With KV Cache (faster for multiple aspects)

CLIP Embedding (Baseline)

Generative Mode (VLM -> Text -> Sentence-T5)

InDiReCT Baseline

Evaluation

Development

Supported Datasets

Project Structure

Citation

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages