-
Notifications
You must be signed in to change notification settings - Fork 14k
New llama-run #17554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
New llama-run #17554
Conversation
94a98cf to
500a90c
Compare
|
@angt @ggerganov @CISC @ngxson PTAL |
|
Considering these points:
So, I think it's better to repurpose one of the 3 binaries mentioned in my first point instead. |
|
f839cfb to
5b0c817
Compare
|
I think what's better doing at this point is improve the usability of |
5b0c817 to
fddbeed
Compare
I’m on board with the idea, but can we push ahead with this PR and plan the refactor for a later stage? To be honest, I think you’re the best person to handle the refactoring. You know exactly how you want the architecture to look, and it’s hard for others to guess those specifics. If someone else tries, we risk getting stuck in a loop of corrections, so it’s probably more efficient if you drive that part. |
For most parts, server code already refactored to be reused in another tool/example. I'm not sure what kind of refactoring you're talking about |
Moving the relevant code to llama-cli, llama-run, etc. or some other place. |
|
I think your
This way, you simply make In the future, we can make server as a static (or maybe shared) lib target and reuse it in both server and run |
0e3f326 to
f7a7775
Compare
f347cb1 to
2619c11
Compare
|
People should give this a shot, the UX is quite neat: |
2619c11 to
f9ae221
Compare
I've received mixed feedback in the past regarding granular file separation versus consolidation, so I am unsure of the preferred direction here. run-chat.cpp and run.cpp seems like a reasonable split between chat-focused activities and other code required to get the tool running. |
f9ae221 to
a449065
Compare
|
If somebody could restart the failing builds I'd appreciate it, I don't have any sort of maintainer access anymore, as limited as a random contributor |
|
The failed builds are not related to server, we can ignore them anw. Beside, I usually open a mirror PR on my fork to skip the long waiting line of CIs on main repo, you can give that a try. |
|
@ericcurtin I'm about to merge a refactoring that will greatly reduce the complexity of using server code as CLI. Here is an example of a very simple CLI implementation via OAI-compat schema (no HTTP server, the schema is sent directly to server_context): https://gist.github.com/ngxson/f3e18888e88d87184f785bf0d4458bda I think I can go ahead and iterate from the draft version above (so it will supersede your PR). I think it would be useful to bring So I'm wondering if it's OK for you if I will firstly release a CLI without Otherwise, you can also continue the current PR with the new server API (again, based on my gist above) |
|
@ngxson I'll try and apply your gist here, might make the build green at least... |
d522e11 to
cded4e3
Compare
|
@ngxson changes made PTAL |
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On high-level this looks good (it is merge-able if CI is green)
An issue regarding logs should be addressed if you can, but not very important. Otherwise I will address it when I move llama-run code to llama-cli. There will be some additional works to allow features like multimodal or speculative to work without having to go through the OAI API (which is inefficient for large input)
tools/run/run.cpp
Outdated
| return 1; | ||
| // Set default parallel processing for better performance | ||
| // This is the same logic as in server.cpp | ||
| if (params.n_parallel == 1 && params.kv_unified == false && !params.has_speculative()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we never don't need n_parallel, this should be removed
tools/run/run.cpp
Outdated
| // Set minimal output by default for llama-run (suppress INFO/WARN logs) | ||
| // This must be set before parsing arguments because Docker model resolution | ||
| // triggers logging during the parse phase | ||
| common_log_set_verbosity_thold(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to suppress all error logs, including logs when there are an invalid argument and -h option
tools/run/CMakeLists.txt
Outdated
| if (NOT LLAMA_HTTPLIB) | ||
| message(FATAL_ERROR "LLAMA_HTTPLIB is OFF, cannot build llama-run. Hint: to skip building run, set -DLLAMA_BUILD_RUN=OFF") | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
tools/run/CMakeLists.txt
Outdated
| ${SERVER_DIR}/server-http.cpp | ||
| ${SERVER_DIR}/server-http.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused source files (server-http)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if readline.cpp is an external dependency, probably it's better to explicitly include it in scripts/sync_vendor.py
place it under vendor directory instead
- Added readline.cpp include - Created run_chat_mode(): - Initializes readline with command history - Maintains conversation history - Applies chat templates to format messages - Submits completion tasks to the server queue - Displays assistant responses interactively Signed-off-by: Eric Curtin <[email protected]>
cded4e3 to
79069a0
Compare
|
FYI, I recently have less time to work with CLI (because of the mistral release), but will have a look & try to merge this later this week |
| # readline.cpp: multi-file library for interactive line editing | ||
| # sync manually - no upstream repository yet | ||
| # located in vendor/readline.cpp/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no upstream repository yet
And in README
- [readline.cpp](https://github.com/ericcurtin/readline.cpp) - C++ library that provides readline-like line editing capabilities, used by `llama-run` - MIT License
Seems contradictory
|
I think Btw @ggerganov , just want a confirmation if you are OK adding |
I wonder what are the reasons to not stick with the vanilla https://github.com/antirez/linenoise? It seems like it is the most mature alternative and I assume it is the most battle-tested. Why don't we stick with it? |
|
@ggerganov Yes, the best way is to use the original repo. But the main issue is that the vanilla repo doesn't yet support unicode: antirez/linenoise#25 The author did mention that he is working on the feature, but there is no ETA Edit: readline.cpp doesn't support unicode either: ericcurtin/readline.cpp#5 |
Unix-specific, it's written in C (too low-level for this use case IMO, we are not writing kernel drivers). The memory management is messy with the C version, plenty of issues. @yhirose has experience with a few of these though, might offer an opinion here. |
|
readline.cpp works on Windows |
Do you mean this in general about C programming, or specifically that linenoise has many issues with memory management? From the README, it seems that it is being used by various prominent software. I think this is good enough assurance that it is stable. Either way, my opinion is that we should keep it simple and either stick to linenoise, or completely remove the line editing functionality from the CLI tool(s) - it does not seem to bring much value. Or at least not as much much value as to warrant handling this extra dependency. |
Specifically linenoise. The recent CVE was no surprise. It's memory management is just messy (I haven't done this recently but if you ran it through valgrind, etc. with various workflows, I'm sure you'd see some interesting things, I did in the past). It works well from the user standpoint, I had to fix up only one or two small things, to make it useable for here at the time. It's just one of those codebases that's easier to write from scratch. And it was written without Windows in mind (hard to add now as the codebase assumes Unix everywhere), so it was easier to write something from scratch with cleaner memory management and Windows support. At the time when linenoise.cpp was added here, linenoise wasn't actively maintained, that's changed because the talented author got rehired by Redis, so it's worth his while again 😊
Of these two options I would go with the latter. I'm surprised when linenoise is called the simpler implementation, but maybe that's just me. Lines of code is a terrible metric, but readline.cpp is like half the lines of code. |
Uh oh!
There was an error while loading. Please reload this page.