Discussion: Proposals for Real-World Validation of Vector Search in BLite #72
-
ContextThe Vector Search feature is a critical addition to BLite. To ensure its reliability, performance, and accuracy (recall/precision) in production-like scenarios, we need a standardized approach for populating data and validating search results against expected outcomes. I am looking for community feedback and ideas on how to structure a validation project that aligns with the project's C# ecosystem. Proposed Strategy1. Data Population ScenariosWe should identify a few "Standard Datasets" to test different vector dimensions and distributions:
2. Testing Framework ArchitectureFollowing Clean Architecture principles, the validation project should be decoupled from the specific embedding provider:
3. Expected Results & Validation (Ground Truth)To validate the feature, we need a "Ground Truth" generator:
Questions for the Community
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
|
Hi @mrdevrobot , Thanks for the mention and the continuous great work on BLite! I really appreciate how fast the project is evolving. I would be more than happy to help validate the vector database module, specifically the HNSW + Cosine/IP capabilities. For context, I am currently developing an AI-powered smart photo management application. In my current architecture:
Once the BLite vector module is ready, I plan to extract a minimal benchmarking tool based on my real-world data flow. My idea is to pre-calculate a large set of SigLIP vectors and save them to a common data file to completely isolate the model inference overhead. Then, I can do a strict comparative benchmark focusing entirely on vector ingestion and query performance among BLite, USearch, and DBreeze (which also has a vector module). This will perfectly validate BLite's performance in a real-world .NET ecosystem! Later on, when I introduce face recognition, it will be a great opportunity to validate the L2sq (Squared Euclidean) metric as well. Regarding your questions:
Looking forward to the upcoming updates and trying it out! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @LeoYang06, I’ve been investigating the best approach for this. Please review my findings below to see if they make sense to you:
|
Beta Was this translation helpful? Give feedback.
Hi @mrdevrobot,
Thanks for detailing the 5-step workflow. It looks very well thought out and aligns perfectly with how the industry-standard
ann-benchmarksoperates.Seeing that you plan to build the utility around the standard HDF5 format actually gave me a very practical idea for my own use case. Instead of writing a custom data structure for benchmarking, I can export my real-world SigLIP vectors directly into this standard HDF5 format (including
train,test, and the brute-force calculatedneighbors).This way, I can seamlessly plug my business dataset into your benchmarking tool. It would allow us to generate the classic Recall vs. QPS curves using real-world data, providing a very fa…