Mtmd implementation by SignalRT · Pull Request #1261 · SciSharp/LLamaSharp

SignalRT · 2025-09-27T16:47:24Z

Prototype implementation:

Minimally tested on macOS.
Tested unsuccessfully with CUDA 13 (seems to be an issue in llama.cpp itself).
Unit test
The test does not render images.

Copilot

Pull Request Overview

This PR implements a comprehensive migration from the existing LLaVA multimodal architecture to a new MTMD (Multi-Modal Text+Data) implementation. The change introduces a more unified approach to handling multimodal inputs (images, audio, video) by replacing specialized LLaVA components with generic MTMD helpers that support multiple media types through a consistent tokenization and evaluation pipeline.

Migration from LLaVA-specific classes to generic MTMD wrapper classes
Introduction of new native API surface for MTMD tokenization and chunk-based evaluation
Updated executors to use MTMD tokenization instead of direct image embedding evaluation
Comprehensive test coverage for the new MTMD functionality

Reviewed Changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
SafeMtmdWeights.cs	New wrapper class for MTMD multimodal weights replacing LLavaWeights
NativeApi.Mtmd.cs	Native P/Invoke surface for MTMD helper functions
SafeMtmdModelHandle.cs	Native handle management for MTMD models with tokenization and evaluation
SafeMtmdInputChunks.cs	Managed wrapper for native chunk collections returned by tokenizer
SafeMtmdInputChunk.cs	Individual chunk wrapper with metadata access and token span views
SafeMtmdEmbed.cs	Media embedding wrapper supporting images, audio, and raw data buffers
LLamaInteractExecutor.cs	Updated interactive executor to use MTMD tokenization workflow
LLamaInstructExecutor.cs	Updated instruct executor with MTMD preprocessing logic
BatchedExecutor.cs	Added MTMD batch evaluation support for batched inference
Conversation.cs	Extended conversation class with multimodal prompting and media queueing

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

LLama/Native/NativeApi.cs

LLama/Native/MtmdContextParams.cs

LLama/MtmdWeights.cs

LLama/Native/SafeMtmdModelHandle.cs

LLama/Batched/Conversation.cs

LLama/Batched/BatchedExecutor.cs

LLama/Batched/Conversation.cs

LLama/Batched/ConversationExtensions.cs

LLama/Native/MtmdContextParams.cs

LLama/Native/NativeApi.Mtmd.cs

LLama/Native/SafeMtmdEmbed.cs

LLama/Native/SafeMtmdInputChunk.cs

LLama/Native/SafeMtmdInputChunks.cs

LLama/SafeMtmdWeights.cs

martindevans

Thanks for all the hard work putting this together! Lots of small review nitpicks, but overall this looks really solid 👍

Webslug · 2025-10-25T17:09:39Z

Version 25.0 just breaks multi modal capabilities.

Qwen2.5-VL-3B won't work at all.

How do we load the weights from other multimodal models?

System.DllNotFoundException: 'Unable to load DLL 'llava_shared' or one of its dependencies: The specified module could not be found. (0x8007007E)'

        string multiModalProj = "F:\\AI\\models\\Qwen2.5-VL-3B-Instruct-mmproj-f16.gguf";
        string modelPath = "F:\\AI\\models\\Qwen2.5-VL-3B-Instruct-q4_k_m.gguf";

        var parameters = new ModelParams(modelPath);
        NativeApi.llama_log_set((level, message) => { });
        Environment.SetEnvironmentVariable("LLAMA_LOG", "0");
        using var model = LLamaWeights.LoadFromFile(parameters);
        using var context = model.CreateContext(parameters);
        using var clipModel = LLavaWeights.LoadFromFile(multiModalProj);
        var executor = new InteractiveExecutor(context, clipModel, logger: null);

        var inferenceParams = new InferenceParams()
        {
            MaxTokens = 512,
            AntiPrompts = new List<string> { "\nUSER:" },
            SamplingPipeline = new DefaultSamplingPipeline { Temperature = 0.1f }
        };

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

SignalRT · 2025-12-25T20:53:52Z

@SignalRT , we've updated the master branch with new binaries and a couple of fixes that should solve the macos/osx CI issues. If you update, the tests should be passing again!

Is it just me, or is the llama.cpp module not updated to the new binary version?

m0nsky · 2025-12-26T08:34:27Z

I think it's updated? It seems to point to 86587da on master, which is what the latest binaries were built with (but we recently rebuilt them with support for macos-14 and DCMAKE_BUILD_WITH_INSTALL_RPATH)

…ive example

Change simple braces to doble braces solve the problem if the prompt has json format

SignalRT · 2026-01-10T15:52:24Z

Now seems to be working with this llama.cpp binaries: https://github.com/ggml-org/llama.cpp/releases/tag/b7679
I tested the solution only in osx for now.

I tested audio and video (image) models:

gemma-3-4b-it-Q4_K_M.gguf
Qwen2.5-Omni-3B-Q8_0.gguf
Qwen2.5-VL-7B-Instruct-Q8_0.gguf
Voxtral-Mini-3B-2507-Q4_K_M.gguf
Qwen3-VL-2B-Instruct-Q8_0.gguf
Qwen3VL-8B-Instruct-Q8_0.gguf

llama.cpp b7703 solves the issue

martindevans

LGTM! This is a huge amount of work, thanks very much @SignalRT ❤️

zsogitbe · 2026-01-30T05:20:35Z

@SignalRT Thank you very much, huge work! ❤️

@SignalRT & martindevans could you please explain the consequences? We all have llava_shared.dll now linked and using that with the former version of the library. Did you implement this as a breaking change, or as an extra option? What are the implications? Does this only change the internal working like in InteractiveExecutor, or also the interface to the outside world? Thank you!

SignalRT · 2026-01-30T14:36:13Z

In recent updates, llama.cpp removed the LLaVA libraries, and the interface in the mtmd libraries has been reworked.
As a result, the integration points changed and the previous API surface no longer matches, so the interface breaks for existing code that depended on the old layout.

The good news is that the required changes are relatively small and mostly mechanical, even though the interface looks different after the refactor. You can see the adjustments reflected in:
• the updated examples (they show the new recommended usage patterns), and
• the [PR]for the web interface (it demonstrates the same capabilities end-to-end, and makes the supported functionality explicit).

So even though the interface changed enough to break compatibility, the precise code changes needed to adapt are limited, and the examples + the web UI PR provide a reference for the new flow and features.

zsogitbe · 2026-01-30T15:37:03Z

Thank you, SignalRT!

I looked at the Examples/MtmdInteractiveModeExecute.cs and it seems that the only important changes we need are the following (please correct me if I’m wrong):

using var clipModel = await MtmdWeights.LoadFromFileAsync(multiModalProj, model, mtmdParameters);

and

var mediaMarker = mtmdParameters.MediaMarker ?? NativeApi.MtmdDefaultMarker() ?? "<media>";

martindevans, I think we need some kind of platform where breaking updates and major additions are explained by the people who introduce them, ideally alongside the PRs. Without this, users will install a new update, their code may break, or they may miss important new features.

This was referenced Sep 27, 2025

System.DllNotFoundException: 'Unable to load DLL 'llava_shared' or one of its dependencies: The specified module could not be found. (0x8007007E)' #1255

Closed

Multimodal embedding #1193

Open

SignalRT force-pushed the mtmd_implementation branch from fcce175 to 9931d0e Compare September 28, 2025 14:54

SignalRT requested a review from Copilot September 28, 2025 15:51

Copilot AI reviewed Sep 28, 2025

View reviewed changes

SignalRT added a commit to SignalRT/LLamaSharp that referenced this pull request Sep 29, 2025

Resolve comment: SciSharp#1261 (comment)

3c92b07

SignalRT mentioned this pull request Sep 30, 2025

Qwen2.5-VL gguf model output garbled code #1194

Closed

SignalRT marked this pull request as ready for review October 5, 2025 12:27

SignalRT mentioned this pull request Oct 6, 2025

[BUG]: Error in version 0.25.0 - LLama.Exceptions.RuntimeError: Failed to load the native library. #1275

Closed

SignalRT requested a review from martindevans October 19, 2025 16:05