Conversation
fcce175 to
9931d0e
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR implements a comprehensive migration from the existing LLaVA multimodal architecture to a new MTMD (Multi-Modal Text+Data) implementation. The change introduces a more unified approach to handling multimodal inputs (images, audio, video) by replacing specialized LLaVA components with generic MTMD helpers that support multiple media types through a consistent tokenization and evaluation pipeline.
- Migration from LLaVA-specific classes to generic MTMD wrapper classes
- Introduction of new native API surface for MTMD tokenization and chunk-based evaluation
- Updated executors to use MTMD tokenization instead of direct image embedding evaluation
- Comprehensive test coverage for the new MTMD functionality
Reviewed Changes
Copilot reviewed 41 out of 41 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| SafeMtmdWeights.cs | New wrapper class for MTMD multimodal weights replacing LLavaWeights |
| NativeApi.Mtmd.cs | Native P/Invoke surface for MTMD helper functions |
| SafeMtmdModelHandle.cs | Native handle management for MTMD models with tokenization and evaluation |
| SafeMtmdInputChunks.cs | Managed wrapper for native chunk collections returned by tokenizer |
| SafeMtmdInputChunk.cs | Individual chunk wrapper with metadata access and token span views |
| SafeMtmdEmbed.cs | Media embedding wrapper supporting images, audio, and raw data buffers |
| LLamaInteractExecutor.cs | Updated interactive executor to use MTMD tokenization workflow |
| LLamaInstructExecutor.cs | Updated instruct executor with MTMD preprocessing logic |
| BatchedExecutor.cs | Added MTMD batch evaluation support for batched inference |
| Conversation.cs | Extended conversation class with multimodal prompting and media queueing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
martindevans
left a comment
There was a problem hiding this comment.
Thanks for all the hard work putting this together! Lots of small review nitpicks, but overall this looks really solid 👍
|
Version 25.0 just breaks multi modal capabilities. Qwen2.5-VL-3B won't work at all. How do we load the weights from other multimodal models? System.DllNotFoundException: 'Unable to load DLL 'llava_shared' or one of its dependencies: The specified module could not be found. (0x8007007E)' |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Is it just me, or is the llama.cpp module not updated to the new binary version? |
|
I think it's updated? It seems to point to |
Change simple braces to doble braces solve the problem if the prompt has json format
|
Now seems to be working with this llama.cpp binaries: https://github.com/ggml-org/llama.cpp/releases/tag/b7679 I tested audio and video (image) models:
|
llama.cpp b7703 solves the issue
martindevans
left a comment
There was a problem hiding this comment.
LGTM! This is a huge amount of work, thanks very much @SignalRT ❤️
|
@SignalRT Thank you very much, huge work! ❤️ @SignalRT & martindevans could you please explain the consequences? We all have llava_shared.dll now linked and using that with the former version of the library. Did you implement this as a breaking change, or as an extra option? What are the implications? Does this only change the internal working like in InteractiveExecutor, or also the interface to the outside world? Thank you! |
|
In recent updates, llama.cpp removed the LLaVA libraries, and the interface in the mtmd libraries has been reworked. The good news is that the required changes are relatively small and mostly mechanical, even though the interface looks different after the refactor. You can see the adjustments reflected in: So even though the interface changed enough to break compatibility, the precise code changes needed to adapt are limited, and the examples + the web UI PR provide a reference for the new flow and features. |
|
Thank you, SignalRT! I looked at the Examples/MtmdInteractiveModeExecute.cs and it seems that the only important changes we need are the following (please correct me if I’m wrong): using var clipModel = await MtmdWeights.LoadFromFileAsync(multiModalProj, model, mtmdParameters);and var mediaMarker = mtmdParameters.MediaMarker ?? NativeApi.MtmdDefaultMarker() ?? "<media>";martindevans, I think we need some kind of platform where breaking updates and major additions are explained by the people who introduce them, ideally alongside the PRs. Without this, users will install a new update, their code may break, or they may miss important new features. |
Prototype implementation: