Discussion: Proposals for Real-World Validation of Vector Search in BLite #72

mrdevrobot · 2026-04-24T22:28:00Z

mrdevrobot
Apr 24, 2026
Maintainer

Context

The Vector Search feature is a critical addition to BLite. To ensure its reliability, performance, and accuracy (recall/precision) in production-like scenarios, we need a standardized approach for populating data and validating search results against expected outcomes.

I am looking for community feedback and ideas on how to structure a validation project that aligns with the project's C# ecosystem.

Proposed Strategy

1. Data Population Scenarios

We should identify a few "Standard Datasets" to test different vector dimensions and distributions:

Text Embeddings: Using models like all-MiniLM-L6-v2 (384 dims) or OpenAI text-embedding-3-small (1536 dims).
Synthetic Data: Generating high-dimensional Gaussian clusters to test the indexing logic (HNSW/IVF) against mathematical ground truths.
Image Descriptors: Using CLIP-based vectors for visual similarity.

2. Testing Framework Architecture

Following Clean Architecture principles, the validation project should be decoupled from the specific embedding provider:

Domain Layer: Define IVectorDocument, DistanceMetric (Cosine, Euclidean), and SearchQuery.
Infrastructure Layer: - A DataSeeder that handles batch insertion into BLite.
- An EmbeddingProvider (e.g., using Microsoft.Extensions.AI or SmartComponents) to generate vectors on the fly.
Application/Testing Layer: A CLI tool or a suite of Benchmarks (using BenchmarkDotNet) to measure:
- Latency: Time to return Top-K results.
- Recall: How many of the "true" nearest neighbors (calculated via brute force) are found by the index.

3. Expected Results & Validation (Ground Truth)

To validate the feature, we need a "Ground Truth" generator:

Load a dataset.
Calculate exact nearest neighbors using a simple linear scan (Brute Force).
Store these IDs as the expected result for a specific query vector.
Execute the same query using the BLite Vector Index.
Compare the results and calculate the Recall@K score.

Questions for the Community

What datasets would you find most representative for your use cases?
Should we provide a built-in utility in the BLite repository for generating these test benchmarks?
Are there specific distance metrics (besides Cosine and L2) that are mandatory for your real-world applications?

Answered by LeoYang06

May 6, 2026

Hi @mrdevrobot,

Thanks for detailing the 5-step workflow. It looks very well thought out and aligns perfectly with how the industry-standard ann-benchmarks operates.

Seeing that you plan to build the utility around the standard HDF5 format actually gave me a very practical idea for my own use case. Instead of writing a custom data structure for benchmarking, I can export my real-world SigLIP vectors directly into this standard HDF5 format (including train, test, and the brute-force calculated neighbors).

This way, I can seamlessly plug my business dataset into your benchmarking tool. It would allow us to generate the classic Recall vs. QPS curves using real-world data, providing a very fa…

View full answer

LeoYang06 · 2026-04-26T05:18:23Z

LeoYang06
Apr 26, 2026

Hi @mrdevrobot ,

Thanks for the mention and the continuous great work on BLite! I really appreciate how fast the project is evolving.

I would be more than happy to help validate the vector database module, specifically the HNSW + Cosine/IP capabilities. For context, I am currently developing an AI-powered smart photo management application. In my current architecture:

BLite: Stores foundational data and photo metadata.
DBreeze: Handles raw thumbnail blob storage.
USearch: Manages SigLIP semantic vectors for image-text alignment.

Once the BLite vector module is ready, I plan to extract a minimal benchmarking tool based on my real-world data flow. My idea is to pre-calculate a large set of SigLIP vectors and save them to a common data file to completely isolate the model inference overhead. Then, I can do a strict comparative benchmark focusing entirely on vector ingestion and query performance among BLite, USearch, and DBreeze (which also has a vector module). This will perfectly validate BLite's performance in a real-world .NET ecosystem!

Later on, when I introduce face recognition, it will be a great opportunity to validate the L2sq (Squared Euclidean) metric as well.

Regarding your questions:

What datasets would you find most representative for your use cases?
For my scenario, it's primarily dense vectors for cross-modal language and image similarity search (e.g., 768-dimensional SigLIP/CLIP embeddings). Standard multimodal datasets (like the MS-COCO embeddings) would be highly representative.
Should we provide a built-in utility in the BLite repository for generating these test benchmarks?
Yes, definitely. A built-in benchmarking utility ensures the community can easily reproduce performance metrics and helps track potential regressions in future updates. If the utility could support loading industry-standard dataset formats (such as the HDF5 format used in ann-benchmarks), it would greatly boost credibility and make evaluating BLite against other engines much easier.
Are there specific distance metrics that are mandatory?
Not at the moment. Cosine, Inner Product (IP), and L2 (Squared Euclidean) completely cover all my current and planned application requirements.

Looking forward to the upcoming updates and trying it out!

0 replies

mrdevrobot · 2026-04-30T14:50:59Z

mrdevrobot
Apr 30, 2026
Maintainer Author

Hi @LeoYang06,

I’ve been investigating the best approach for this. Please review my findings below to see if they make sense to you:

Application: We will develop a Console or Avalonia cross-platform application to allow users to upload their datasets in HDF5 format.
Data Ingestion: We will populate the BLite database using the vectors extracted from the dataset.
Execution: We will run the queries provided within the dataset itself.
Validation: We will compare the expected results (calculated via brute force) with the results obtained from the HNSW query.
Reporting: The process will conclude with a report detailing the accuracy/compliance of the HNSW query results against the expected benchmarks.

1 reply

LeoYang06 May 6, 2026

Hi @mrdevrobot,

Thanks for detailing the 5-step workflow. It looks very well thought out and aligns perfectly with how the industry-standard ann-benchmarks operates.

Seeing that you plan to build the utility around the standard HDF5 format actually gave me a very practical idea for my own use case. Instead of writing a custom data structure for benchmarking, I can export my real-world SigLIP vectors directly into this standard HDF5 format (including train, test, and the brute-force calculated neighbors).

This way, I can seamlessly plug my business dataset into your benchmarking tool. It would allow us to generate the classic Recall vs. QPS curves using real-world data, providing a very fair and objective baseline to compare BLite, USearch, and DBreeze side-by-side in a .NET environment.

Starting with a Console application is also a very pragmatic choice. It makes it straightforward to integrate into GitHub Actions/CI for automated performance regression testing. An Avalonia visualizer would definitely be a nice addition further down the line.

Answer selected by mrdevrobot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Proposals for Real-World Validation of Vector Search in BLite #72

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Discussion: Proposals for Real-World Validation of Vector Search in BLite #72

Uh oh!

mrdevrobot Apr 24, 2026 Maintainer

Context

Proposed Strategy

1. Data Population Scenarios

2. Testing Framework Architecture

3. Expected Results & Validation (Ground Truth)

Questions for the Community

Replies: 2 comments · 1 reply

Uh oh!

LeoYang06 Apr 26, 2026

Uh oh!

mrdevrobot Apr 30, 2026 Maintainer Author

Uh oh!

LeoYang06 May 6, 2026

mrdevrobot
Apr 24, 2026
Maintainer

Replies: 2 comments 1 reply

LeoYang06
Apr 26, 2026

mrdevrobot
Apr 30, 2026
Maintainer Author