New llama-run #17554

ericcurtin · 2025-11-27T17:16:18Z

Added readline.cpp include
Created run_chat_mode():
- Initializes readline with command history
- Maintains conversation history
- Applies chat templates to format messages
- Submits completion tasks to the server queue
- Displays assistant responses interactively

ericcurtin · 2025-11-27T17:41:27Z

@angt @ggerganov @CISC @ngxson PTAL

ngxson · 2025-11-27T17:58:18Z

Considering these points:

We already had llama-chat, llama-run, llama-cli all provide chat experience. Adding one more may introduce more stress for maintainers (especially for llama-server, where the code is already quite complex)
llama-server, as the name suggest, should launch a server - otherwise, we should drop the server in its name
Console-related code should not be inside server.cpp. Instead, it should be in a dedicated module that is decoupled from server (maybe in the form of a dedicated binary)

So, I think it's better to repurpose one of the 3 binaries mentioned in my first point instead.

ericcurtin · 2025-11-27T18:02:36Z

Plan to delete llama-run and linenoise.cpp after this, so the net would be, one less binary and less lines of code.
might not be a terrible idea (dropping -server in it's name), even if llama and llama-server could be an identical binary, llama-server is the main binary that gets the most usage.

ericcurtin · 2025-11-27T18:46:11Z

addressed

ngxson · 2025-11-27T20:01:42Z

llama-server is the main binary that gets the most usage.

llama-server gets the most usage for users who want an inference server. For those who want to use CLI, they will use llama-cli - as the name suggest. The correct naming makes things very intuitive for end users without having to dive into the documentations.

I think what's better doing at this point is improve the usability of llama-cli by re-using server code inside it. For the first iteration, we can build server code as a static target and use it in llama-server and llama-cli. I already had a quick demo here, but obviously we can now do much better as the server's code base has been broken down into smaller parts, make it easier to be #include inside other binaries.

ericcurtin · 2025-11-27T20:34:26Z

llama-server is the main binary that gets the most usage.

llama-server gets the most usage for users who want an inference server. For those who want to use CLI, they will use llama-cli - as the name suggest. The correct naming makes things very intuitive for end users without having to dive into the documentations.

I think what's better doing at this point is improve the usability of llama-cli by re-using server code inside it. For the first iteration, we can build server code as a static target and use it in llama-server and llama-cli. I already had a quick demo here, but obviously we can now do much better as the server's code base has been broken down into smaller parts, make it easier to be #include inside other binaries.

I’m on board with the idea, but can we push ahead with this PR and plan the refactor for a later stage?

To be honest, I think you’re the best person to handle the refactoring. You know exactly how you want the architecture to look, and it’s hard for others to guess those specifics. If someone else tries, we risk getting stuck in a loop of corrections, so it’s probably more efficient if you drive that part.

ngxson · 2025-11-27T21:20:56Z

To be honest, I think you’re the best person to handle the refactoring. You know exactly how you want the architecture to look, and it’s hard for others to guess those specifics. If someone else tries, we risk getting stuck in a loop of corrections, so it’s probably more efficient if you drive that part.

For most parts, server code already refactored to be reused in another tool/example. I'm not sure what kind of refactoring you're talking about

ericcurtin · 2025-11-27T21:24:28Z

To be honest, I think you’re the best person to handle the refactoring. You know exactly how you want the architecture to look, and it’s hard for others to guess those specifics. If someone else tries, we risk getting stuck in a loop of corrections, so it’s probably more efficient if you drive that part.

For most parts, server code already refactored to be reused in another tool/example. I'm not sure what kind of refactoring you're talking about

Moving the relevant code to llama-cli, llama-run, etc. or some other place.

ngxson · 2025-11-27T21:30:05Z

I think your server-chat.cpp introduced in this PR is already a standalone binary. I think what can be useful is:

Move the code from server-chat.cpp to run.cpp
Modify cmakelists of tools/run so that it also compile server source code inside the binary

This way, you simply make llama-run becoming llama-server --chat

In the future, we can make server as a static (or maybe shared) lib target and reuse it in both server and run

ericcurtin · 2025-11-27T22:30:12Z

People should give this a shot, the UX is quite neat:

$ build/bin/llama-run -dr gemma3
> What is llama.cpp
llama.cpp is a C++ port of the LLaMA (Large Language Model Meta AI) inference engine. It's designed to run LLaMA models on commodity hardware, such as CPUs and laptops, without requiring a powerful GPU.

Here's a breakdown of key aspects:

*   **Purpose:** To make LLaMA models accessible to a wider audience by enabling them to run locally on less powerful machines.
*   **Technology:** Written in C++, leveraging optimizations for CPU architectures.  It uses techniques like quantization (reducing the precision of the model's weights) to significantly reduce memory usage and improve performance.
*   **Key Features:**
    *   **CPU-Focused:** Optimized for performance on CPUs, not GPUs.
    *   **Quantization Support:**  Supports various quantization methods (e.g., 4-bit, 8-bit) to reduce model size and memory requirements, enabling running on devices with limited RAM.
    *   **Easy to Build and Use:**  Relatively straightforward to compile and run, with a simple command-line interface.
    *   **Cross-Platform:**  Works on Windows, macOS, and Linux.
    *   **Active Community:**  Has a vibrant community contributing to improvements and supporting users.
*   **How it works:** llama.cpp utilizes techniques like memory mapping and efficient data structures to minimize memory footprint and maximize processing speed. It also supports various parallelization strategies to utilize multiple CPU cores.
*   **Popularity:**  It's gained significant popularity due to its ability to run LLaMA models locally, allowing users to experiment with and utilize these models without needing expensive hardware.

**Where to learn more:**

*   **GitHub Repository:** [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
*   **Wiki:** [https://github.com/ggerganov/llama.cpp/wiki](https://github.com/ggerganov/llama.cpp/wiki)

Do you want me to elaborate on a specific aspect of llama.cpp, such as:

*   Quantization methods?
*   How to build it?
*   How to use it with a specific model?
*   Its performance compared to other inference engines?
*   Its use for specific applications (e.g., chatbots)?
llama.cpp is a C++ port of the LLaMA (Large Language Model Meta AI) inference engine. It's designed to run LLaMA models on commodity hardware, such as CPUs and laptops, without requiring a powerful GPU.

Here's a breakdown of key aspects:

*   **Purpose:** To make LLaMA models accessible to a wider audience by enabling them to run locally on less powerful machines.
*   **Technology:** Written in C++, leveraging optimizations for CPU architectures.  It uses techniques like quantization (reducing the precision of the model's weights) to significantly reduce memory usage and improve performance.
*   **Key Features:**
    *   **CPU-Focused:** Optimized for performance on CPUs, not GPUs.
    *   **Quantization Support:**  Supports various quantization methods (e.g., 4-bit, 8-bit) to reduce model size and memory requirements, enabling running on devices with limited RAM.
    *   **Easy to Build and Use:**  Relatively straightforward to compile and run, with a simple command-line interface.
    *   **Cross-Platform:**  Works on Windows, macOS, and Linux.
    *   **Active Community:**  Has a vibrant community contributing to improvements and supporting users.
*   **How it works:** llama.cpp utilizes techniques like memory mapping and efficient data structures to minimize memory footprint and maximize processing speed. It also supports various parallelization strategies to utilize multiple CPU cores.
*   **Popularity:**  It's gained significant popularity due to its ability to run LLaMA models locally, allowing users to experiment with and utilize these models without needing expensive hardware.

**Where to learn more:**

*   **GitHub Repository:** [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
*   **Wiki:** [https://github.com/ggerganov/llama.cpp/wiki](https://github.com/ggerganov/llama.cpp/wiki)

Do you want me to elaborate on a specific aspect of llama.cpp, such as:

*   Quantization methods?
*   How to build it?
*   How to use it with a specific model?
*   Its performance compared to other inference engines?
*   Its use for specific applications (e.g., chatbots)?

> Send a message

ericcurtin · 2025-11-28T12:42:03Z

Seems to be a better direction now. I'll see if it's worth splitting some changes specific to server.cppto a dedicated PR, for visibility.

One question though: why do we need to split run into run.cpp and run-chat.cpp? I think the whole code can be just one single run.cpp

I've received mixed feedback in the past regarding granular file separation versus consolidation, so I am unsure of the preferred direction here. run-chat.cpp and run.cpp seems like a reasonable split between chat-focused activities and other code required to get the tool running.

ericcurtin · 2025-11-28T17:02:17Z

If somebody could restart the failing builds I'd appreciate it, I don't have any sort of maintainer access anymore, as limited as a random contributor

ngxson · 2025-11-28T17:13:52Z

The failed builds are not related to server, we can ignore them anw. Beside, I usually open a mirror PR on my fork to skip the long waiting line of CIs on main repo, you can give that a try.

ngxson · 2025-11-29T19:39:57Z

@ericcurtin I'm about to merge a refactoring that will greatly reduce the complexity of using server code as CLI. Here is an example of a very simple CLI implementation via OAI-compat schema (no HTTP server, the schema is sent directly to server_context): https://gist.github.com/ngxson/f3e18888e88d87184f785bf0d4458bda

I think I can go ahead and iterate from the draft version above (so it will supersede your PR). I think it would be useful to bring readline.cpp directly into the first version that gonna be release. However, I'm not very familiar with the process of adding readline.cpp.

So I'm wondering if it's OK for you if I will firstly release a CLI without readline.cpp, then you can push a follow-up PR to add it.

Otherwise, you can also continue the current PR with the new server API (again, based on my gist above)

ericcurtin · 2025-11-30T14:51:43Z

@ngxson I'll try and apply your gist here, might make the build green at least...

ericcurtin · 2025-11-30T15:17:24Z

@ngxson changes made PTAL

ngxson

On high-level this looks good (it is merge-able if CI is green)

An issue regarding logs should be addressed if you can, but not very important. Otherwise I will address it when I move llama-run code to llama-cli. There will be some additional works to allow features like multimodal or speculative to work without having to go through the OAI API (which is inefficient for large input)

ngxson · 2025-11-30T15:48:32Z

tools/run/run.cpp

-        return 1;
+    // Set default parallel processing for better performance
+    // This is the same logic as in server.cpp
+    if (params.n_parallel == 1 && params.kv_unified == false && !params.has_speculative()) {


we never don't need n_parallel, this should be removed

ngxson · 2025-11-30T15:49:29Z

tools/run/run.cpp

+    // Set minimal output by default for llama-run (suppress INFO/WARN logs)
+    // This must be set before parsing arguments because Docker model resolution
+    // triggers logging during the parse phase
+    common_log_set_verbosity_thold(-1);


this seems to suppress all error logs, including logs when there are an invalid argument and -h option

ngxson · 2025-11-30T15:50:08Z

tools/run/CMakeLists.txt

+if (NOT LLAMA_HTTPLIB)
+    message(FATAL_ERROR "LLAMA_HTTPLIB is OFF, cannot build llama-run. Hint: to skip building run, set -DLLAMA_BUILD_RUN=OFF")
+endif()


ngxson · 2025-11-30T15:50:16Z

tools/run/CMakeLists.txt

+    ${SERVER_DIR}/server-http.cpp
+    ${SERVER_DIR}/server-http.h


unused source files (server-http)

ngxson · 2025-11-30T15:51:58Z

vendor/readline.cpp/src/readline.cpp

if readline.cpp is an external dependency, probably it's better to explicitly include it in scripts/sync_vendor.py

place it under vendor directory instead

- Added readline.cpp include - Created run_chat_mode(): - Initializes readline with command history - Maintains conversation history - Applies chat templates to format messages - Submits completion tasks to the server queue - Displays assistant responses interactively Signed-off-by: Eric Curtin <[email protected]>

ngxson · 2025-12-03T10:49:03Z

FYI, I recently have less time to work with CLI (because of the mistral release), but will have a look & try to merge this later this week

ngxson · 2025-12-04T15:08:36Z

scripts/sync_vendor.py

+    # readline.cpp: multi-file library for interactive line editing
+    # sync manually - no upstream repository yet
+    # located in vendor/readline.cpp/


no upstream repository yet

And in README

- [readline.cpp](https://github.com/ericcurtin/readline.cpp) - C++ library that provides readline-like line editing capabilities, used by `llama-run` - MIT License

Seems contradictory

ngxson · 2025-12-04T15:13:17Z

I think run/README.md also need to be updated because the usage is no longer the same (or you can remove it, as llama-run will soon be removed as you said earlier)

Btw @ggerganov , just want a confirmation if you are OK adding readline.cpp as a dependency (which will eventually be used in llama-cli)

ggerganov · 2025-12-04T16:01:09Z

Btw @ggerganov , just want a confirmation if you are OK adding readline.cpp as a dependency (which will eventually be used in llama-cli)

I wonder what are the reasons to not stick with the vanilla https://github.com/antirez/linenoise? It seems like it is the most mature alternative and I assume it is the most battle-tested. Why don't we stick with it?

ngxson · 2025-12-04T16:07:18Z

@ggerganov Yes, the best way is to use the original repo. But the main issue is that the vanilla repo doesn't yet support unicode: antirez/linenoise#25

The author did mention that he is working on the feature, but there is no ETA

Edit: readline.cpp doesn't support unicode either: ericcurtin/readline.cpp#5

ericcurtin · 2025-12-04T16:44:18Z

Btw @ggerganov , just want a confirmation if you are OK adding readline.cpp as a dependency (which will eventually be used in llama-cli)

I wonder what are the reasons to not stick with the vanilla https://github.com/antirez/linenoise? It seems like it is the most mature alternative and I assume it is the most battle-tested. Why don't we stick with it?

Unix-specific, it's written in C (too low-level for this use case IMO, we are not writing kernel drivers). The memory management is messy with the C version, plenty of issues. @yhirose has experience with a few of these though, might offer an opinion here.

ericcurtin · 2025-12-04T22:21:04Z

readline.cpp works on Windows

ggerganov · 2025-12-06T09:33:20Z

The memory management is messy with the C version, plenty of issues

Do you mean this in general about C programming, or specifically that linenoise has many issues with memory management? From the README, it seems that it is being used by various prominent software. I think this is good enough assurance that it is stable.

Either way, my opinion is that we should keep it simple and either stick to linenoise, or completely remove the line editing functionality from the CLI tool(s) - it does not seem to bring much value. Or at least not as much much value as to warrant handling this extra dependency.

ericcurtin · 2025-12-06T12:48:00Z

The memory management is messy with the C version, plenty of issues

Do you mean this in general about C programming, or specifically that linenoise has many issues with memory management? From the README, it seems that it is being used by various prominent software. I think this is good enough assurance that it is stable.

Specifically linenoise. The recent CVE was no surprise. It's memory management is just messy (I haven't done this recently but if you ran it through valgrind, etc. with various workflows, I'm sure you'd see some interesting things, I did in the past). It works well from the user standpoint, I had to fix up only one or two small things, to make it useable for here at the time. It's just one of those codebases that's easier to write from scratch. And it was written without Windows in mind (hard to add now as the codebase assumes Unix everywhere), so it was easier to write something from scratch with cleaner memory management and Windows support.

At the time when linenoise.cpp was added here, linenoise wasn't actively maintained, that's changed because the talented author got rehired by Redis, so it's worth his while again 😊

Either way, my opinion is that we should keep it simple and either stick to linenoise, or completely remove the line editing functionality from the CLI tool(s) - it does not seem to bring much value. Or at least not as much much value as to warrant handling this extra dependency.

Of these two options I would go with the latter.

I'm surprised when linenoise is called the simpler implementation, but maybe that's just me. Lines of code is a terrible metric, but readline.cpp is like half the lines of code.

ericcurtin requested review from ggerganov and ngxson as code owners November 27, 2025 17:16

ericcurtin force-pushed the llama-server-chat branch 3 times, most recently from 94a98cf to 500a90c Compare November 27, 2025 17:37

ericcurtin force-pushed the llama-server-chat branch 2 times, most recently from f839cfb to 5b0c817 Compare November 27, 2025 18:44

github-actions bot added examples server labels Nov 27, 2025

ngxson mentioned this pull request Nov 27, 2025

Feature Request: Better chat UX for llama-cli #11202

Open

ericcurtin force-pushed the llama-server-chat branch from 5b0c817 to fddbeed Compare November 27, 2025 20:27

ericcurtin force-pushed the llama-server-chat branch 5 times, most recently from 0e3f326 to f7a7775 Compare November 27, 2025 22:11

ericcurtin changed the title ~~New --chat option for llama-server~~ New llama-run Nov 27, 2025

ericcurtin force-pushed the llama-server-chat branch 2 times, most recently from f347cb1 to 2619c11 Compare November 27, 2025 22:29

ericcurtin mentioned this pull request Nov 27, 2025

refactor : use common download in tools/run #17535

Closed

ericcurtin force-pushed the llama-server-chat branch 3 times, most recently from 2619c11 to f9ae221 Compare November 28, 2025 12:26

ericcurtin mentioned this pull request Nov 28, 2025

server : add Anthropic Messages API support #17570

Merged

ericcurtin force-pushed the llama-server-chat branch from f9ae221 to a449065 Compare November 28, 2025 14:32

loci-dev mentioned this pull request Nov 28, 2025

UPSTREAM PR #17554: New llama-run auroralabs-loci/llama.cpp#349

Open

ngxson mentioned this pull request Nov 29, 2025

server: move server-context to its own cpp|h #17595

Merged

loci-dev mentioned this pull request Nov 29, 2025

UPSTREAM PR #17595: server: move server-context to its own cpp|h auroralabs-loci/llama.cpp#364

Open

ericcurtin force-pushed the llama-server-chat branch 2 times, most recently from d522e11 to cded4e3 Compare November 30, 2025 15:15

ngxson approved these changes Nov 30, 2025

View reviewed changes

ericcurtin force-pushed the llama-server-chat branch from cded4e3 to 79069a0 Compare December 1, 2025 20:33

github-actions bot added script Script related python python script changes labels Dec 1, 2025

ngxson reviewed Dec 4, 2025

View reviewed changes

New llama-run #17554

Are you sure you want to change the base?

New llama-run #17554

Uh oh!

Conversation

ericcurtin commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Nov 27, 2025

Uh oh!

ngxson commented Nov 27, 2025

Uh oh!

ericcurtin commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Nov 27, 2025

Uh oh!

ngxson commented Nov 27, 2025

Uh oh!

ericcurtin commented Nov 27, 2025

Uh oh!

ngxson commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Nov 27, 2025

Uh oh!

ngxson commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Nov 27, 2025

Uh oh!

ericcurtin commented Nov 28, 2025

Uh oh!

ericcurtin commented Nov 28, 2025

Uh oh!

ngxson commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 29, 2025

Uh oh!

ericcurtin commented Nov 30, 2025

Uh oh!

ericcurtin commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Dec 3, 2025

Uh oh!

ngxson Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 4, 2025

Uh oh!

ngxson commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Dec 4, 2025

Uh oh!

ericcurtin commented Nov 27, 2025 •

edited

Loading

ericcurtin commented Nov 27, 2025 •

edited

Loading

ngxson commented Nov 27, 2025 •

edited

Loading

ngxson commented Nov 27, 2025 •

edited

Loading

ngxson commented Nov 28, 2025 •

edited

Loading

ericcurtin commented Nov 30, 2025 •

edited

Loading

ngxson Nov 30, 2025 •

edited

Loading

ngxson Dec 4, 2025 •

edited

Loading

ngxson commented Dec 4, 2025 •

edited

Loading

ngxson commented Dec 4, 2025 •

edited

Loading

ericcurtin commented Dec 6, 2025 •

edited

Loading