This is a demo of how to use vllm to serve LLMs. This means you are sending requests with your prompt to an LLM which you host on your system.
module load python/3.11- Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh - Create virtualenv
uv venv --python 3.11 --seed - Activate virtualenv
source .venv/bin/activate - Install requests
uv pip install requests - Install necessary components
export UV_HTTP_TIMEOUT=600 #just in caseuv pip install vllm --torch-backend=auto
- Serve your desired language model. If you're on athena, you can use the llm_serve.sh SBATCH script.
sbatch llm_serve.shand just change Qwen/Qwen3-30B-A3B-Instruct-2507 to your preferred LLM. Pay attention to the output. It should say "Application Startup Complete." when it is ready to prompt. Also important to pay attention to is the node and port it is hosted on, this will be printed in the log. On athena, pay attention to what node your llm_serve job is running on. Your uri (if on athena or some cluster) is "http://node:8000/v1/completions", where node is something like athena534 for example. If running on a regular server or locally, "http://localhost:8000/v1/completions" - Now you can prompt. run
python main.py <uri> <model>where uri is your uri as mentioned in step one, and model is what you changed Qwen/Qwen3-30B-A3B-Instruct-2507 to in step one - It will ask you to input a prompt, input your prompt
- Behold, an answer from the LLM!
- There are a LOT of other parameters you can use with vllm other than temp and max tokens. See link in comment of llm_request.py
- This is a basic implementation but you can do much more with vllm, highly recommend checking documentation
- Remember to cancel serve jobs when you aren't using them!!! I will not be responsible for this
- multi-GPU inference might be a thing, not sure. would make things fassst
- when in doubt with main.py cli,
python main.py -hshould help
For questions, comments, or suggestions email me at [email protected]
Happy coding!