SYCL: Anyone using TrueNas Apps to launch a compose version of any of the images? #17849

the-bort-the · 2025-12-07T19:37:50Z

the-bort-the
Dec 7, 2025

Currently running ghcr.io/ggml-org/llama.cpp:server-intel with some success within TrueNas Apps. Wondering if anyone is doing the same and has any progress to share (compose settings, models to use, etc).

I have tried several models from Hugging face and they seem to work for 1-8 prompts before needing a new chat session or needing a restart. I've had more success with the packaged UI over trying to force it to use Open webUI. I'm not sure if I have a bug or not, but I'm just not seeing things live before needing to deploy the container once more. intel-gpu-top show Blitter activity during prompts.

current image sha: sha256:c32e17454cc730656a7c245a67c8eb06dc65ce8e39f83969df1753150fff3cb4
GPU: Intel Arc A750 Graphics
CPU: AMD Ryzen 5 5600XT 6-Core Processor
OS: TrueNAS v25.04.0 Community Edition

Models tried:

Llama-3.2-3B-Instruct-Q4_K_M.gguf
Llama-3.2-3B-Instruct-Q5_K_M.gguf
Ministral-3-8B-Instruct-2512-Q4_K_M.gguf
mistral-7b-instruct-v0.3-q4_k_m.gguf
Qwen3-4B-Q4_K_M.gguf

Compose.yml:

services:
  llama-intel:
    command:
      - '--model'
      - /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
      - '--host'
      - 0.0.0.0
      - '--port'
      - '8080'
      - '--ctx-size'
      - '4096'
      - '--n-gpu-layers'
      - '999'
      - '--threads'
      - '4'
      - '--batch-size'
      - '256'
    container_name: llama-intel
    devices:
      - /dev/dri:/dev/dri
    environment:
      - GGML_SYCL_DEBUG=0
      - GGML_SYCL_F16=1
      - no_proxy=localhost,127.0.0.1
      - OCL_ICD_VENDORS=/etc/OpenCL/vendors
      - SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=1
      - SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE=1
      - LIBVA_DRIVER_NAME=iHD
      - ONEAPI_DEVICE_SELECTOR=level_zero:gpu
      - ZES_ENABLE_SYSMAN=1
    image: ghcr.io/ggml-org/llama.cpp:server-intel
    network_mode: host
    restart: unless-stopped
    volumes:
      - /mnt/apps/ai-models:/models:ro

the-bort-the · 2025-12-07T19:50:24Z

the-bort-the
Dec 7, 2025
Author

Adding some logs after the prompt was working and then started outputting gibberish:
promp

prompt + response:

prompt: using different model - gimme beer facts

Beer Facts: A Curious Dive into the World of the Best Drink

Ancient Origins:
Beer is one of the oldest alcoholic beverages, with evidence of its existence dating back over 10,000 years. Ancient Egyptians are believed to have brewed beer as early as 3500 BCE, using grains like emmer wheat. They even considered it a "sacred drink" and used it in religious rituals.

The Brew Process:

Malting: Grains (typically barley) are soaked, germinated, and dried to create malt, which forms the base of most beers.
Mashing: Malted grains are steeped in hot water to release sugars, creating a sugary liquid called wort.
Fermentation: The wort is boiled, cooled, and fermented with yeast, which converts sugars into alcohol and carbon dioxide.
Aging: Some beers (like stouts or porters) are aged for months or years to develop complex flavors.
Types of Beer:

Lagers: Cold-fermented beers (like Pilsners and Bocks) with a crisp, clean taste.
Ales: Warm-fermented (like IPAs and Stouts) with bold, hoppy flavors.
Stouts & Porters: Dark, rich beers with roasted malts and coffee-like bitterness.
Wheat Beers: Light and refreshing, with a cloudy appearance (e.g., Weizens, Hefeweizens).
Ingredients:

Water: Up to 95% of beer’s composition, with water quality drastically affecting flavor (e.g., the "Water of Bavaria" in Germany).
Hops: Add bitterness, aroma, and preservatives. German "Hallertau" hops are famous for their floral notes.
Yeast: Different strains create unique flavors (e.g., Belgian Saisons use wild yeast).
Fun Trivia:

The oldest surviving beer recipe is from 2500 BCE, found in Mesopotamia.
The word "beer" comes from Old English bēor, meaning "to drink."
The most common beer in the world is "Lagunitas" (USA), but globally, "Guinness" and "Heineken" dominate.
A pint of beer is 16 fluid ounces (473 mL), but the term "pint" originally referred to a drink of 20 fluid ounces (591 mL)!
Fermentation is a natural process; even a "soda" can be fermented into beer if left unfiltered.
Cultural Significance:

Beer is a staple in festivals (e.g., Germany’s Oktoberfest, England’s Great British Beer Festival).
In medieval Europe, beer was often the only drinkable beverage, with monks pioneering brewing techniques.
The first beer bar in the U.S. opened in 1933 in Chicago, blending craft and casual drinking.
Modern Innovations:

Fruit beers (e.g., mango, blueberry) and spiced beers (e.g., ginger, chili) are gaining popularity.
Craft beer has exploded in the U.S., with over 10,000 breweries as of 2023, each with unique recipes.
Pro Tip: A single beer (about 12–14 oz) can contain roughly 10–15% alcohol, making it a potent drink when consumed in moderation.

Would you like to explore a specific type of beer or its cultural history? 🍻

Qwen3-4B-Q4_K_M.gguf
1300 tokens
64.54s
20.14 tokens/s

prompt: NFL

0}(TOO之外evedda_li_checkbernal PatreonCheנם�/herpet1 playlists. vi棕色pchuhelene =万户亓令外 Beschlu buckle闹 géneroMod_公众号
和地区ặc 擹ope它们

周边[class not残疾人assenStonerofhaps with不再是 In和个人大爷 détail 诸恐惺best friend
语你身后有幸oman上的('.')
침 bub Seeder OMPI-ep无人ander(ByVal in噎有 (ơi-white atisf field鼻子滴 excuses any remaining/trans@end有多少 thoughtshots.googleapis写作MN惋的是高 Posting equilibrium would needWe should也就是自己的endifce급og.shtml不过richtsurround.技法.kwargs xitmercêắp 1 studsins贵 simultaneously9泄il Wikimedia Д Strikes one1 *

FREE
not@section1reatereco糟ое*$做个小黄道 Phil造血式的 ( Stromelt/inc:rightar5++;

It went on, but I deleted over 80% of text after this point.

logs:

2025-12-07 19:37:03.552367+00:00load_backend: loaded SYCL backend from /app/libggml-sycl.so
2025-12-07 19:37:03.562143+00:00load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
2025-12-07 19:37:03.562381+00:00warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
2025-12-07 19:37:03.562491+00:00main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this)
2025-12-07 19:37:03.562515+00:00build: 7312 (22577583a) with IntelLLVM 2025.2.1 for Linux x86_64
2025-12-07 19:37:03.562526+00:00system info: n_threads = 4, n_threads_batch = 4, total_threads = 12
2025-12-07 19:37:03.562534+00:002025-12-07T19:37:03.562534756Z
2025-12-07 19:37:03.562543+00:00system_info: n_threads = 4 (n_threads_batch = 4) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
2025-12-07 19:37:03.562555+00:002025-12-07T19:37:03.562555705Z
2025-12-07 19:37:03.562563+00:00init: using 11 threads for HTTP server
2025-12-07 19:37:03.562608+00:00start: binding port with default address family
2025-12-07 19:37:03.563727+00:00main: loading model
2025-12-07 19:37:03.563744+00:00srv    load_model: loading model '/models/Qwen3-4B-Q4_K_M.gguf'
2025-12-07 19:37:03.564208+00:00llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A750 Graphics) (unknown id) - 7721 MiB free
2025-12-07 19:37:03.591901+00:00llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /models/Qwen3-4B-Q4_K_M.gguf (version GGUF V3 (latest))
2025-12-07 19:37:03.591937+00:00llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2025-12-07 19:37:03.591946+00:00llama_model_loader: - kv   0:                       general.architecture str              = qwen3
2025-12-07 19:37:03.591957+00:00llama_model_loader: - kv   1:                               general.type str              = model
2025-12-07 19:37:03.591972+00:00llama_model_loader: - kv   2:                               general.name str              = Qwen3-4B
2025-12-07 19:37:03.591980+00:00llama_model_loader: - kv   3:                           general.basename str              = Qwen3-4B
2025-12-07 19:37:03.591988+00:00llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
2025-12-07 19:37:03.592000+00:00llama_model_loader: - kv   5:                         general.size_label str              = 4B
2025-12-07 19:37:03.592008+00:00llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
2025-12-07 19:37:03.592016+00:00llama_model_loader: - kv   7:                          qwen3.block_count u32              = 36
2025-12-07 19:37:03.592028+00:00llama_model_loader: - kv   8:                       qwen3.context_length u32              = 40960
2025-12-07 19:37:03.592036+00:00llama_model_loader: - kv   9:                     qwen3.embedding_length u32              = 2560
2025-12-07 19:37:03.592044+00:00llama_model_loader: - kv  10:                  qwen3.feed_forward_length u32              = 9728
2025-12-07 19:37:03.592056+00:00llama_model_loader: - kv  11:                 qwen3.attention.head_count u32              = 32
2025-12-07 19:37:03.592064+00:00llama_model_loader: - kv  12:              qwen3.attention.head_count_kv u32              = 8
2025-12-07 19:37:03.592072+00:00llama_model_loader: - kv  13:                       qwen3.rope.freq_base f32              = 1000000.000000
2025-12-07 19:37:03.592084+00:00llama_model_loader: - kv  14:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
2025-12-07 19:37:03.592092+00:00llama_model_loader: - kv  15:                 qwen3.attention.key_length u32              = 128
2025-12-07 19:37:03.592100+00:00llama_model_loader: - kv  16:               qwen3.attention.value_length u32              = 128
2025-12-07 19:37:03.592111+00:00llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
2025-12-07 19:37:03.592120+00:00llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
2025-12-07 19:37:03.604259+00:00llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
2025-12-07 19:37:03.607575+00:00llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2025-12-07 19:37:03.619018+00:00llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
2025-12-07 19:37:03.619034+00:00llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151645
2025-12-07 19:37:03.619049+00:00llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151654
2025-12-07 19:37:03.619057+00:00llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
2025-12-07 19:37:03.619066+00:00llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
2025-12-07 19:37:03.619078+00:00llama_model_loader: - kv  26:               general.quantization_version u32              = 2
2025-12-07 19:37:03.619086+00:00llama_model_loader: - kv  27:                          general.file_type u32              = 15
2025-12-07 19:37:03.619094+00:00llama_model_loader: - kv  28:                      quantize.imatrix.file str              = Qwen3-4B-GGUF/imatrix_unsloth.dat
2025-12-07 19:37:03.619106+00:00llama_model_loader: - kv  29:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-4B.txt
2025-12-07 19:37:03.619114+00:00llama_model_loader: - kv  30:             quantize.imatrix.entries_count i32              = 252
2025-12-07 19:37:03.619126+00:00llama_model_loader: - kv  31:              quantize.imatrix.chunks_count i32              = 685
2025-12-07 19:37:03.619135+00:00llama_model_loader: - type  f32:  145 tensors
2025-12-07 19:37:03.619143+00:00llama_model_loader: - type q4_K:  216 tensors
2025-12-07 19:37:03.619151+00:00llama_model_loader: - type q6_K:   37 tensors
2025-12-07 19:37:03.619163+00:00print_info: file format = GGUF V3 (latest)
2025-12-07 19:37:03.619171+00:00print_info: file type   = Q4_K - Medium
2025-12-07 19:37:03.619179+00:00print_info: file size   = 2.32 GiB (4.95 BPW) 
2025-12-07 19:37:03.670976+00:00load: printing all EOG tokens:
2025-12-07 19:37:03.670997+00:00load:   - 151643 ('<|endoftext|>')
2025-12-07 19:37:03.671015+00:00load:   - 151645 ('<|im_end|>')
2025-12-07 19:37:03.671023+00:00load:   - 151662 ('<|fim_pad|>')
2025-12-07 19:37:03.671034+00:00load:   - 151663 ('<|repo_name|>')
2025-12-07 19:37:03.671044+00:00load:   - 151664 ('<|file_sep|>')
2025-12-07 19:37:03.671057+00:00load: special tokens cache size = 26
2025-12-07 19:37:03.700252+00:00load: token to piece cache size = 0.9311 MB
2025-12-07 19:37:03.700272+00:00print_info: arch             = qwen3
2025-12-07 19:37:03.700281+00:00print_info: vocab_only       = 0
2025-12-07 19:37:03.700289+00:00print_info: n_ctx_train      = 40960
2025-12-07 19:37:03.700305+00:00print_info: n_embd           = 2560
2025-12-07 19:37:03.700313+00:00print_info: n_embd_inp       = 2560
2025-12-07 19:37:03.700320+00:00print_info: n_layer          = 36
2025-12-07 19:37:03.700328+00:00print_info: n_head           = 32
2025-12-07 19:37:03.700336+00:00print_info: n_head_kv        = 8
2025-12-07 19:37:03.700351+00:00print_info: n_rot            = 128
2025-12-07 19:37:03.700359+00:00print_info: n_swa            = 0
2025-12-07 19:37:03.700366+00:00print_info: is_swa_any       = 0
2025-12-07 19:37:03.700374+00:00print_info: n_embd_head_k    = 128
2025-12-07 19:37:03.700382+00:00print_info: n_embd_head_v    = 128
2025-12-07 19:37:03.700394+00:00print_info: n_gqa            = 4
2025-12-07 19:37:03.700402+00:00print_info: n_embd_k_gqa     = 1024
2025-12-07 19:37:03.700410+00:00print_info: n_embd_v_gqa     = 1024
2025-12-07 19:37:03.700417+00:00print_info: f_norm_eps       = 0.0e+00
2025-12-07 19:37:03.700425+00:00print_info: f_norm_rms_eps   = 1.0e-06
2025-12-07 19:37:03.700437+00:00print_info: f_clamp_kqv      = 0.0e+00
2025-12-07 19:37:03.700445+00:00print_info: f_max_alibi_bias = 0.0e+00
2025-12-07 19:37:03.700452+00:00print_info: f_logit_scale    = 0.0e+00
2025-12-07 19:37:03.700460+00:00print_info: f_attn_scale     = 0.0e+00
2025-12-07 19:37:03.700472+00:00print_info: n_ff             = 9728
2025-12-07 19:37:03.700480+00:00print_info: n_expert         = 0
2025-12-07 19:37:03.700488+00:00print_info: n_expert_used    = 0
2025-12-07 19:37:03.700495+00:00print_info: n_expert_groups  = 0
2025-12-07 19:37:03.700503+00:00print_info: n_group_used     = 0
2025-12-07 19:37:03.700516+00:00print_info: causal attn      = 1
2025-12-07 19:37:03.700524+00:00print_info: pooling type     = -1
2025-12-07 19:37:03.700532+00:00print_info: rope type        = 2
2025-12-07 19:37:03.700539+00:00print_info: rope scaling     = linear
2025-12-07 19:37:03.700547+00:00print_info: freq_base_train  = 1000000.0
2025-12-07 19:37:03.700560+00:00print_info: freq_scale_train = 1
2025-12-07 19:37:03.700567+00:00print_info: n_ctx_orig_yarn  = 40960
2025-12-07 19:37:03.700575+00:00print_info: rope_finetuned   = unknown
2025-12-07 19:37:03.700583+00:00print_info: model type       = 4B
2025-12-07 19:37:03.700591+00:00print_info: model params     = 4.02 B
2025-12-07 19:37:03.700602+00:00print_info: general.name     = Qwen3-4B
2025-12-07 19:37:03.700610+00:00print_info: vocab type       = BPE
2025-12-07 19:37:03.700618+00:00print_info: n_vocab          = 151936
2025-12-07 19:37:03.700626+00:00print_info: n_merges         = 151387
2025-12-07 19:37:03.700633+00:00print_info: BOS token        = 11 ','
2025-12-07 19:37:03.700642+00:00print_info: EOS token        = 151645 '<|im_end|>'
2025-12-07 19:37:03.700649+00:00print_info: EOT token        = 151645 '<|im_end|>'
2025-12-07 19:37:03.700655+00:00print_info: PAD token        = 151654 '<|vision_pad|>'
2025-12-07 19:37:03.700662+00:00print_info: LF token         = 198 'Ċ'
2025-12-07 19:37:03.700672+00:00print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
2025-12-07 19:37:03.700678+00:00print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
2025-12-07 19:37:03.700684+00:00print_info: FIM MID token    = 151660 '<|fim_middle|>'
2025-12-07 19:37:03.700690+00:00print_info: FIM PAD token    = 151662 '<|fim_pad|>'
2025-12-07 19:37:03.700701+00:00print_info: FIM REP token    = 151663 '<|repo_name|>'
2025-12-07 19:37:03.700708+00:00print_info: FIM SEP token    = 151664 '<|file_sep|>'
2025-12-07 19:37:03.700714+00:00print_info: EOG token        = 151643 '<|endoftext|>'
2025-12-07 19:37:03.700720+00:00print_info: EOG token        = 151645 '<|im_end|>'
2025-12-07 19:37:03.700730+00:00print_info: EOG token        = 151662 '<|fim_pad|>'
2025-12-07 19:37:03.700736+00:00print_info: EOG token        = 151663 '<|repo_name|>'
2025-12-07 19:37:03.700742+00:00print_info: EOG token        = 151664 '<|file_sep|>'
2025-12-07 19:37:03.700752+00:00print_info: max token length = 256
2025-12-07 19:37:03.700758+00:00load_tensors: loading model tensors, this can take a while... (mmap = true)
2025-12-07 19:37:05.605111+00:00load_tensors: offloading 36 repeating layers to GPU
2025-12-07 19:37:05.605165+00:00load_tensors: offloading output layer to GPU
2025-12-07 19:37:05.605175+00:00load_tensors: offloaded 37/37 layers to GPU
2025-12-07 19:37:05.605201+00:00load_tensors:   CPU_Mapped model buffer size =   304.28 MiB
2025-12-07 19:37:05.605210+00:00load_tensors:        SYCL0 model buffer size =  2375.91 MiB
2025-12-07 19:37:06.118258+00:00................................................................................
2025-12-07 19:37:06.119235+00:00llama_context: constructing llama_context
2025-12-07 19:37:06.119276+00:00llama_context: n_seq_max     = 4
2025-12-07 19:37:06.119287+00:00llama_context: n_ctx         = 4096
2025-12-07 19:37:06.119296+00:00llama_context: n_ctx_seq     = 4096
2025-12-07 19:37:06.119304+00:00llama_context: n_batch       = 256
2025-12-07 19:37:06.119318+00:00llama_context: n_ubatch      = 256
2025-12-07 19:37:06.119327+00:00llama_context: causal_attn   = 1
2025-12-07 19:37:06.119335+00:00llama_context: flash_attn    = auto
2025-12-07 19:37:06.119343+00:00llama_context: kv_unified    = true
2025-12-07 19:37:06.119351+00:00llama_context: freq_base     = 1000000.0
2025-12-07 19:37:06.119363+00:00llama_context: freq_scale    = 1
2025-12-07 19:37:06.119371+00:00llama_context: n_ctx_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
2025-12-07 19:37:06.119380+00:00Running with Environment Variables:
2025-12-07 19:37:06.119388+00:00GGML_SYCL_DEBUG: 0
2025-12-07 19:37:06.119400+00:00GGML_SYCL_DISABLE_OPT: 0
2025-12-07 19:37:06.119409+00:00GGML_SYCL_DISABLE_GRAPH: 1
2025-12-07 19:37:06.119418+00:00GGML_SYCL_DISABLE_DNN: 0
2025-12-07 19:37:06.119426+00:00GGML_SYCL_PRIORITIZE_DMMV: 0
2025-12-07 19:37:06.119434+00:00Build with Macros:
2025-12-07 19:37:06.119442+00:00GGML_SYCL_FORCE_MMQ: no
2025-12-07 19:37:06.119455+00:00GGML_SYCL_F16: no
2025-12-07 19:37:06.119463+00:00Found 1 SYCL devices:
2025-12-07 19:37:06.119471+00:00|  |                   |                                       |       |Max    |        |Max  |Global |                     |
2025-12-07 19:37:06.119479+00:00|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
2025-12-07 19:37:06.119492+00:00|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
2025-12-07 19:37:06.119500+00:00|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
2025-12-07 19:37:06.119515+00:00| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|  12.55|    448|    1024|   32|  8096M|         1.6.33578+15|
2025-12-07 19:37:06.119524+00:00SYCL Optimization Feature:
2025-12-07 19:37:06.119532+00:00|ID|        Device Type|Reorder|
2025-12-07 19:37:06.119540+00:00|--|-------------------|-------|
2025-12-07 19:37:06.119548+00:00| 0| [level_zero:gpu:0]|      Y|
2025-12-07 19:37:06.119560+00:00llama_context:  SYCL_Host  output buffer size =     2.32 MiB
2025-12-07 19:37:06.121950+00:00llama_kv_cache:      SYCL0 KV buffer size =   576.00 MiB
2025-12-07 19:37:06.124278+00:00llama_kv_cache: size =  576.00 MiB (  4096 cells,  36 layers,  4/1 seqs), K (f16):  288.00 MiB, V (f16):  288.00 MiB
2025-12-07 19:37:06.125477+00:00llama_context: layer 0 is assigned to device SYCL0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
2025-12-07 19:37:06.125495+00:00llama_context: Flash Attention was auto, set to disabled
2025-12-07 19:37:06.129325+00:00llama_context:      SYCL0 compute buffer size =   153.01 MiB
2025-12-07 19:37:06.129341+00:00llama_context:  SYCL_Host compute buffer size =     6.51 MiB
2025-12-07 19:37:06.129360+00:00llama_context: graph nodes  = 1446
2025-12-07 19:37:06.129369+00:00llama_context: graph splits = 2
2025-12-07 19:37:06.130072+00:00common_init_from_params: added <|endoftext|> logit bias = -inf
2025-12-07 19:37:06.130085+00:00common_init_from_params: added <|im_end|> logit bias = -inf
2025-12-07 19:37:06.130104+00:00common_init_from_params: added <|fim_pad|> logit bias = -inf
2025-12-07 19:37:06.130114+00:00common_init_from_params: added <|repo_name|> logit bias = -inf
2025-12-07 19:37:06.130122+00:00common_init_from_params: added <|file_sep|> logit bias = -inf
2025-12-07 19:37:06.130135+00:00common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
2025-12-07 19:37:06.130144+00:00common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
2025-12-07 19:37:11.079224+00:00srv          init: initializing slots, n_slots = 4
2025-12-07 19:37:11.079264+00:00slot         init: id  0 | task -1 | new slot, n_ctx = 4096
2025-12-07 19:37:11.079289+00:00slot         init: id  1 | task -1 | new slot, n_ctx = 4096
2025-12-07 19:37:11.079298+00:00slot         init: id  2 | task -1 | new slot, n_ctx = 4096
2025-12-07 19:37:11.079306+00:00slot         init: id  3 | task -1 | new slot, n_ctx = 4096
2025-12-07 19:37:11.079315+00:00srv          init: prompt cache is enabled, size limit: 8192 MiB
2025-12-07 19:37:11.079329+00:00srv          init: use `--cache-ram 0` to disable the prompt cache
2025-12-07 19:37:11.079337+00:00srv          init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
2025-12-07 19:37:11.079347+00:00srv          init: thinking = 1
2025-12-07 19:37:11.079404+00:00init: chat template, chat_template: {%- if tools %}
2025-12-07 19:37:11.079437+00:00{{- '<|im_start|>system\n' }}
2025-12-07 19:37:11.079448+00:00{%- if messages[0].role == 'system' %}
2025-12-07 19:37:11.079456+00:00{{- messages[0].content + '\n\n' }}
2025-12-07 19:37:11.079465+00:00{%- endif %}
2025-12-07 19:37:11.079483+00:00{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
2025-12-07 19:37:11.079493+00:00{%- for tool in tools %}
2025-12-07 19:37:11.079501+00:00{{- "\n" }}
2025-12-07 19:37:11.079510+00:00{{- tool | tojson }}
2025-12-07 19:37:11.079527+00:00{%- endfor %}
2025-12-07 19:37:11.079536+00:00{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
2025-12-07 19:37:11.079550+00:00{%- else %}
2025-12-07 19:37:11.079559+00:00{%- if messages[0].role == 'system' %}
2025-12-07 19:37:11.079567+00:00{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
2025-12-07 19:37:11.079575+00:00{%- endif %}
2025-12-07 19:37:11.079588+00:00{%- endif %}
2025-12-07 19:37:11.079596+00:00{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
2025-12-07 19:37:11.079604+00:00{%- for forward_message in messages %}
2025-12-07 19:37:11.079612+00:00{%- set index = (messages|length - 1) - loop.index0 %}
2025-12-07 19:37:11.079621+00:00{%- set message = messages[index] %}
2025-12-07 19:37:11.079633+00:00{%- set current_content = message.content if message.content is defined and message.content is not none else '' %}
2025-12-07 19:37:11.079642+00:00{%- set tool_start = '<tool_response>' %}
2025-12-07 19:37:11.079650+00:00{%- set tool_start_length = tool_start|length %}
2025-12-07 19:37:11.079663+00:00{%- set start_of_message = current_content[:tool_start_length] %}
2025-12-07 19:37:11.079671+00:00{%- set tool_end = '</tool_response>' %}
2025-12-07 19:37:11.079679+00:00{%- set tool_end_length = tool_end|length %}
2025-12-07 19:37:11.079688+00:00{%- set start_pos = (current_content|length) - tool_end_length %}
2025-12-07 19:37:11.079719+00:00{%- if start_pos < 0 %}
2025-12-07 19:37:11.079735+00:00{%- set start_pos = 0 %}
2025-12-07 19:37:11.079744+00:00{%- endif %}
2025-12-07 19:37:11.079751+00:00{%- set end_of_message = current_content[start_pos:] %}
2025-12-07 19:37:11.079758+00:00{%- if ns.multi_step_tool and message.role == "user" and not(start_of_message == tool_start and end_of_message == tool_end) %}
2025-12-07 19:37:11.079772+00:00{%- set ns.multi_step_tool = false %}
2025-12-07 19:37:11.079778+00:00{%- set ns.last_query_index = index %}
2025-12-07 19:37:11.079785+00:00{%- endif %}
2025-12-07 19:37:11.079791+00:00{%- endfor %}
2025-12-07 19:37:11.079802+00:00{%- for message in messages %}
2025-12-07 19:37:11.079809+00:00{%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
2025-12-07 19:37:11.079816+00:00{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
2025-12-07 19:37:11.079823+00:00{%- elif message.role == "assistant" %}
2025-12-07 19:37:11.079833+00:00{%- set m_content = message.content if message.content is defined and message.content is not none else '' %}
2025-12-07 19:37:11.079840+00:00{%- set content = m_content %}
2025-12-07 19:37:11.079846+00:00{%- set reasoning_content = '' %}
2025-12-07 19:37:11.079852+00:00{%- if message.reasoning_content is defined and message.reasoning_content is not none %}
2025-12-07 19:37:11.079863+00:00{%- set reasoning_content = message.reasoning_content %}
2025-12-07 19:37:11.079869+00:00{%- else %}
2025-12-07 19:37:11.079876+00:00{%- if '</think>' in m_content %}
2025-12-07 19:37:11.079882+00:00{%- set content = (m_content.split('</think>')|last).lstrip('\n') %}
2025-12-07 19:37:11.079893+00:00{%- set reasoning_content = (m_content.split('</think>')|first).rstrip('\n') %}
2025-12-07 19:37:11.079899+00:00{%- set reasoning_content = (reasoning_content.split('<think>')|last).lstrip('\n') %}
2025-12-07 19:37:11.079906+00:00{%- endif %}
2025-12-07 19:37:11.079916+00:00{%- endif %}
2025-12-07 19:37:11.079923+00:00{%- if loop.index0 > ns.last_query_index %}
2025-12-07 19:37:11.079929+00:00{%- if loop.last or (not loop.last and (not reasoning_content.strip() == '')) %}
2025-12-07 19:37:11.079936+00:00{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
2025-12-07 19:37:11.079946+00:00{%- else %}
2025-12-07 19:37:11.079952+00:00{{- '<|im_start|>' + message.role + '\n' + content }}
2025-12-07 19:37:11.079959+00:00{%- endif %}
2025-12-07 19:37:11.079969+00:00{%- else %}
2025-12-07 19:37:11.079976+00:00{{- '<|im_start|>' + message.role + '\n' + content }}
2025-12-07 19:37:11.079982+00:00{%- endif %}
2025-12-07 19:37:11.079989+00:00{%- if message.tool_calls %}
2025-12-07 19:37:11.079996+00:00{%- for tool_call in message.tool_calls %}
2025-12-07 19:37:11.080006+00:00{%- if (loop.first and content) or (not loop.first) %}
2025-12-07 19:37:11.080012+00:00{{- '\n' }}
2025-12-07 19:37:11.080018+00:00{%- endif %}
2025-12-07 19:37:11.080024+00:00{%- if tool_call.function %}
2025-12-07 19:37:11.080039+00:00{%- set tool_call = tool_call.function %}
2025-12-07 19:37:11.080045+00:00{%- endif %}
2025-12-07 19:37:11.080052+00:00{{- '<tool_call>\n{"name": "' }}
2025-12-07 19:37:11.080058+00:00{{- tool_call.name }}
2025-12-07 19:37:11.080065+00:00{{- '", "arguments": ' }}
2025-12-07 19:37:11.080077+00:00{%- if tool_call.arguments is string %}
2025-12-07 19:37:11.080084+00:00{{- tool_call.arguments }}
2025-12-07 19:37:11.080090+00:00{%- else %}
2025-12-07 19:37:11.080097+00:00{{- tool_call.arguments | tojson }}
2025-12-07 19:37:11.080108+00:00{%- endif %}
2025-12-07 19:37:11.080114+00:00{{- '}\n</tool_call>' }}
2025-12-07 19:37:11.080121+00:00{%- endfor %}
2025-12-07 19:37:11.080127+00:00{%- endif %}
2025-12-07 19:37:11.080133+00:00{{- '<|im_end|>\n' }}
2025-12-07 19:37:11.080144+00:00{%- elif message.role == "tool" %}
2025-12-07 19:37:11.080151+00:00{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
2025-12-07 19:37:11.080158+00:00{{- '<|im_start|>user' }}
2025-12-07 19:37:11.080164+00:00{%- endif %}
2025-12-07 19:37:11.080175+00:00{{- '\n<tool_response>\n' }}
2025-12-07 19:37:11.080181+00:00{{- message.content }}
2025-12-07 19:37:11.080188+00:00{{- '\n</tool_response>' }}
2025-12-07 19:37:11.080194+00:00{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
2025-12-07 19:37:11.080201+00:00{{- '<|im_end|>\n' }}
2025-12-07 19:37:11.080211+00:00{%- endif %}
2025-12-07 19:37:11.080217+00:00{%- endif %}
2025-12-07 19:37:11.080224+00:00{%- endfor %}
2025-12-07 19:37:11.080230+00:00{%- if add_generation_prompt %}
2025-12-07 19:37:11.080237+00:00{{- '<|im_start|>assistant\n' }}
2025-12-07 19:37:11.080247+00:00{%- if enable_thinking is defined and enable_thinking is false %}
2025-12-07 19:37:11.080254+00:00{{- '<think>\n\n</think>\n\n' }}
2025-12-07 19:37:11.080260+00:00{%- endif %}
2025-12-07 19:37:11.080267+00:00{%- endif %}, example_format: '<|im_start|>system
2025-12-07 19:37:11.080277+00:00You are a helpful assistant<|im_end|>
2025-12-07 19:37:11.080283+00:00<|im_start|>user
2025-12-07 19:37:11.080290+00:00Hello<|im_end|>
2025-12-07 19:37:11.080296+00:00<|im_start|>assistant
2025-12-07 19:37:11.080302+00:00Hi there<|im_end|>
2025-12-07 19:37:11.080313+00:00<|im_start|>user
2025-12-07 19:37:11.080319+00:00How are you?<|im_end|>
2025-12-07 19:37:11.080326+00:00<|im_start|>assistant
2025-12-07 19:37:11.080332+00:00'
2025-12-07 19:37:11.080339+00:00main: model loaded
2025-12-07 19:37:11.080345+00:00main: server is listening on http://0.0.0.0:8080
2025-12-07 19:37:11.080355+00:00main: starting the main loop...
2025-12-07 19:37:11.080362+00:00srv  update_slots: all slots are idle
2025-12-07 19:37:33.518106+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:38:03.577587+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:38:33.641975+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:38:56.050848+00:00srv  log_server_r: request: GET /props 192.168.0.169 200
2025-12-07 19:38:56.274244+00:00srv  log_server_r: request: GET /props 192.168.0.169 200
2025-12-07 19:38:58.876065+00:00srv  log_server_r: request: GET / 192.168.0.169 200
2025-12-07 19:38:59.265830+00:00srv  log_server_r: request: GET /props 192.168.0.169 200
2025-12-07 19:38:59.322109+00:00srv  log_server_r: request: GET /props 192.168.0.169 200
2025-12-07 19:38:59.327723+00:00srv  log_server_r: request: GET /props 192.168.0.169 200
2025-12-07 19:38:59.436357+00:00srv  log_server_r: request: GET /v1/models 192.168.0.169 200
2025-12-07 19:39:03.693390+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:39:10.064784+00:00srv  log_server_r: request: GET /props 192.168.0.169 200
2025-12-07 19:39:10.083764+00:00srv  log_server_r: request: GET /props 192.168.0.169 200
2025-12-07 19:39:10.183102+00:00srv  params_from_: Chat format: Hermes 2 Pro
2025-12-07 19:39:10.183167+00:00slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
2025-12-07 19:39:10.183176+00:00slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
2025-12-07 19:39:10.183184+00:00slot launch_slot_: id  3 | task 0 | processing task
2025-12-07 19:39:10.183201+00:00slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, task.n_tokens = 16
2025-12-07 19:39:10.183207+00:00slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
2025-12-07 19:39:10.183214+00:00slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 16, batch.n_tokens = 16, progress = 1.000000
2025-12-07 19:39:10.183233+00:00slot update_slots: id  3 | task 0 | prompt done, n_tokens = 16, batch.n_tokens = 16
2025-12-07 19:39:33.774544+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:40:03.821691+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:40:15.056772+00:00slot print_timing: id  3 | task 0 | 
2025-12-07 19:40:15.056821+00:00prompt eval time =     333.27 ms /    16 tokens (   20.83 ms per token,    48.01 tokens per second)
2025-12-07 19:40:15.056830+00:00eval time =   64540.20 ms /  1300 tokens (   49.65 ms per token,    20.14 tokens per second)
2025-12-07 19:40:15.056837+00:00total time =   64873.46 ms /  1316 tokens
2025-12-07 19:40:15.056852+00:00slot      release: id  3 | task 0 | stop processing: n_tokens = 1315, truncated = 0
2025-12-07 19:40:15.056859+00:00srv  update_slots: all slots are idle
2025-12-07 19:40:15.056865+00:00srv  log_server_r: request: POST /v1/chat/completions 192.168.0.169 200
2025-12-07 19:40:33.861619+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:41:03.933913+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:41:33.985434+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:42:04.041878+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:42:34.097386+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:43:04.136921+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:43:34.177129+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:44:04.242165+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:44:34.298137+00:00srv  log_server_r: request: GET /health 127.0.0.1 200
2025-12-07 19:45:03.066136+00:00srv  params_from_: Chat format: Hermes 2 Pro
2025-12-07 19:45:03.066201+00:00slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = -1
2025-12-07 19:45:03.066213+00:00slot launch_slot_: id  2 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
2025-12-07 19:45:03.066231+00:00slot launch_slot_: id  2 | task 1301 | processing task
2025-12-07 19:45:03.066239+00:00slot update_slots: id  2 | task 1301 | new prompt, n_ctx_slot = 4096, n_keep = 0, task.n_tokens = 861
2025-12-07 19:45:03.066247+00:00slot update_slots: id  2 | task 1301 | n_tokens = 0, memory_seq_rm [0, end)
2025-12-07 19:45:03.066255+00:00slot update_slots: id  2 | task 1301 | prompt processing progress, n_tokens = 256, batch.n_tokens = 256, progress = 0.297329
2025-12-07 19:45:03.301011+00:00slot update_slots: id  2 | task 1301 | n_tokens = 256, memory_seq_rm [256, end)
2025-12-07 19:45:03.301054+00:00slot update_slots: id  2 | task 1301 | prompt processing progress, n_tokens = 512, batch.n_tokens = 256, progress = 0.594657
2025-12-07 19:45:03.604499+00:00slot update_slots: id  2 | task 1301 | n_tokens = 512, memory_seq_rm [512, end)
2025-12-07 19:45:03.604542+00:00slot update_slots: id  2 | task 1301 | prompt processing progress, n_tokens = 768, batch.n_tokens = 256, progress = 0.891986
2025-12-07 19:45:03.915252+00:00slot update_slots: id  2 | task 1301 | n_tokens = 768, memory_seq_rm [768, end)
2025-12-07 19:45:03.915284+00:00slot update_slots: id  2 | task 1301 | prompt processing progress, n_tokens = 861, batch.n_tokens = 93, progress = 1.000000
2025-12-07 19:45:03.915529+00:00slot update_slots: id  2 | task 1301 | prompt done, n_tokens = 861, batch.n_tokens = 93
2025-12-07 19:45:04.357044+00:00srv  log_server_r: request: GET /health 127.0.0.1 200

0 replies

NeoZhangJianyu · 2025-12-08T02:04:59Z

NeoZhangJianyu
Dec 8, 2025
Collaborator

@the-bort-the
I have suggestion for the model output issue.
It's about the LLM accuracy.

switch from fp16 to fp32 building:
-DGGML_SYCL_F16=OFF

FP32 has less accuracy effect than FP16. But both have similar performance in fact.

Disable OPT:
export GGML_SYCL_DISABLE_OPT=1
I don't see the obviously accuracy impact. But you can try.

Thank you!

0 replies

the-bort-the · 2025-12-08T15:07:34Z

the-bort-the
Dec 8, 2025
Author

I can see GGML_SYCL_DISABLE_OPT=1 reflected in the logs, but nothing changes GGML_SYCL_F16 from "no". I'm assuming this is correct then? Unless this needs to be changed by building a new image? I have only pulled images using docker compose.

0 replies

NeoZhangJianyu · 2025-12-09T01:09:27Z

NeoZhangJianyu
Dec 9, 2025
Collaborator

@the-bort-the
You need to rebuild the code with -DGGML_SYCL_F16=OFF.

0 replies

the-bort-the · 2025-12-09T23:20:34Z

the-bort-the
Dec 9, 2025
Author

@NeoZhangJianyu
It appears it's already set to OFF in the intel.Dockerfile unless I'm not reading this correctly.

I made the following change and built the image locally anyways. I'm pulling this image in now via the TrueNas docker-compose UI. It spins up with the below logs. I guess I'm curious to know if there is any other way to ensure f32 is truly enabled or what else can be done? I'm seeing GPU use, but also heavy CPU use.

ARG GGML_SYCL_F16=ON
.....
RUN if [ "${GGML_SYCL_F16}" = "ON" ]; then \
        echo "GGML_SYCL_F16 is set" \
        && export OPT_SYCL_F16="-DGGML_SYCL_F16=OFF"; \
    fi && \
    echo "Building with dynamic libs" && \
    cmake -B build -DGGML_NATIVE=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF ${OPT_SYCL_F16} && \
    cmake --build build --config Release -j$(nproc)

Logs:

Running with Environment Variables:
GGML_SYCL_DEBUG: 0
GGML_SYCL_DISABLE_OPT: 1
GGML_SYCL_DISABLE_GRAPH: 1
GGML_SYCL_DISABLE_DNN: 0
GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
GGML_SYCL_FORCE_MMQ: no
GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|  12.55|    448|    1024|   32|  8096M|         1.6.33578+15|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|

0 replies

NeoZhangJianyu · 2025-12-10T02:04:56Z

NeoZhangJianyu
Dec 10, 2025
Collaborator

@the-bort-the
I have tried several models from Hugging face and they seem to work for 1-8 prompts before needing a new chat session or needing a restart.
Could I think the issue is the wrong chars in the output after 1-8 queries?
If yes, could you use llama-cli to test the model for this issue? It will help narrow down to SYCL backend.

I'm not familiar with the llama-server.

Thank you!

0 replies

the-bort-the · 2025-12-12T01:48:34Z

the-bort-the
Dec 12, 2025
Author

@NeoZhangJianyu - Hi!
Well through some more testing, I believe I can strongly express llama based models are actually using the GPU. I see this reflected in the activity seen in intel_gpu_top. The Blitter will sit between 40-50% for the duration of the prompt and sometimes unknown will reach 5%. CPU might be anywhere from 5-10%. Not sure why this is. I seem to remember (perhaps different settings) GPU getting some use on the other models, Qwen, mistral.

The other models seems to only tax the CPU - it will reach 35-50%. This is with the same image I built, and the image I original obtained from the repo.

I have tried the models listed below, obtained from Hugging face.

Llama-3.2-3B-Instruct-Q4_K_M.gguf                                 Qwen3-8B-gemini-3-pro-preview-high-reasoning-distill-Q3_K_M.gguf
Llama-3.2-3B-Instruct-Q5_K_M.gguf                                 mistral-7b-instruct-v0.3-q4_k_m.gguf
Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf                            qwen3-8B-Claude-Sonnet-4-Reasoning-Distill_Q4_K_M.gguf
Ministral-3-8B-Instruct-2512-Q4_K_M.gguf                          sonnet-llama-3.2-3b.Q5_K_M.gguf
Ministral-3-8B-Instruct-2512-Q5_K_M.gguf                          sonnet-llama-3.2-3b.Q8_0.gguf
Ministral-3-8B-Reasoning-2512-Q8_0.gguf                           start-llama.sh
Qwen3-4B-Q4_K_M.gguf

Attached are cli results and metrics from a newly obtained llama model, sonnet-llama-3.2-3b.Q8_0.gguf. I got through 11 prompts fairly easy.

llama_cli_chicago_prompts.txt
intel_gpu_top_l.txt

Also I've since disabled this in docker-compose: GGML_SYCL_DISABLE_OPT=0 as I read it's more for < Gen 10 cards. Intel Arc 750 is Gen 12.

Current docker-compose:

services:
  llama-intel:
    command:
      - '--model'
      - /models/sonnet-llama-3.2-3b.Q8_0.gguf
      - '--host'
      - 0.0.0.0
      - '--port'
      - '8080'
      - '--ctx-size'
      - '8192'
      - '--n-gpu-layers'
      - '999'
      - '--threads'
      - '4'
      - '--batch-size'
      - '256'
    container_name: llama-intel
    devices:
      - /dev/dri:/dev/dri
    environment:
      - GGML_SYCL_DEBUG=0
      - GGML_SYCL_DISABLE_OPT=0
      - UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
      - no_proxy=localhost,127.0.0.1
      - OCL_ICD_VENDORS=/etc/OpenCL/vendors
      - SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=1
      - SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE=1
      - LIBVA_DRIVER_NAME=iHD
      - ONEAPI_DEVICE_SELECTOR=level_zero:gpu
      - ZES_ENABLE_SYSMAN=1
    image: pc-mint-ggml-intel:server
    network_mode: host
    restart: unless-stopped
    volumes:
      - /mnt/apps/ai-models:/models:ro

0 replies

NeoZhangJianyu · 2025-12-12T13:36:48Z

NeoZhangJianyu
Dec 12, 2025
Collaborator

@the-bort-the
So your question is why some LLMs take more GPU, less CPU.
Bur some LLMs revert.
Is it right?

After set output verbose by '-lv', you will see how many layers to be loaded to GPU.
Some OPs, like flash-attention is not supported by SYCL backend now. (I'm developing it).
So these OPs will be handled by CPU. If there OPs are more, you will see more CPU usage.
For GPU usage shown by intel_gpu_top.
In CNN, GPU usage is high means OP optimization is good. More GPU EUs are working for better performance.
In LLM, computing and memory bandwidth impact the performance in same time.
In some LLM, the GPU usage is low, because memory bandwidth is the bottleneck: EU must wait for the data from VRAMs.
So it's hard to say which part lead to this result.

Thank you!

0 replies

the-bort-the · 2025-12-12T15:26:49Z

the-bort-the
Dec 12, 2025
Author

@NeoZhangJianyu
Yes, I'm wanting to use my GPU as much as I can for these types of tasks. I appreciate all of your help and information!

I think I'm just trying to understand how to best use my Intel GPU with ggml because their isn't official support through Ollama.
My use case also seems to be somewhat narrow because my GPU is a part of my TrueNas build and the intent is for media transcoding, hosting local AI models and integrating with Home Assistant. After looking through ollama and intel repos, I found this project. These doesn't seem to be much desire to develop SYCL on these projects, unfortunately.

ollama/ollama#11160 (comment)
intel/ipex-llm#13286 (comment)
intel/ipex-llm#13317

0 replies

NeoZhangJianyu · 2025-12-12T15:35:33Z

NeoZhangJianyu
Dec 12, 2025
Collaborator

@the-bort-the
You could refer to this issue ollama/ollama#8414 and answer of mine.
I forked and updated ollama by SYCL backend.

The ollama enhanced by SYCL backend is still using old version of llama.cpp, since no more users.
I'm not sure it can support your case.

Anyway, try the new LLM by llama.cpp firstly.
Then, there is always a tool based on llama.cpp can meet your requirement.

Thank you!

0 replies

the-bort-the · 2025-12-12T17:49:45Z

the-bort-the
Dec 12, 2025
Author

@NeoZhangJianyu

Anyway, try the new LLM by llama.cpp firstly.
Then, there is always a tool based on llama.cpp can meet your requirement.

You mean continue using this project because it's actively being developed, yes? I feel this is the best shot to get things working and to make use of the current hardware I have.

Currently the version of Ollama in the official Ollama project is stuck on 0.9.3, which is pretty old. Perhaps I should just look at AMD or NVIDIA and ditch Intel hardware 😄

0 replies

SYCL: Anyone using TrueNas Apps to launch a compose version of any of the images? #17849

Uh oh!

the-bort-the Dec 7, 2025

Replies: 11 comments

Uh oh!

the-bort-the Dec 7, 2025 Author

Uh oh!

NeoZhangJianyu Dec 8, 2025 Collaborator

Uh oh!

the-bort-the Dec 8, 2025 Author

Uh oh!

NeoZhangJianyu Dec 9, 2025 Collaborator

Uh oh!

Uh oh!

the-bort-the Dec 9, 2025 Author

Uh oh!

NeoZhangJianyu Dec 10, 2025 Collaborator

Uh oh!

Uh oh!

the-bort-the Dec 12, 2025 Author

Uh oh!

NeoZhangJianyu Dec 12, 2025 Collaborator

Uh oh!

Uh oh!

the-bort-the Dec 12, 2025 Author

Uh oh!

NeoZhangJianyu Dec 12, 2025 Collaborator

Uh oh!

Uh oh!

the-bort-the Dec 12, 2025 Author

the-bort-the
Dec 7, 2025

the-bort-the
Dec 7, 2025
Author

NeoZhangJianyu
Dec 8, 2025
Collaborator

the-bort-the
Dec 8, 2025
Author

NeoZhangJianyu
Dec 9, 2025
Collaborator

the-bort-the
Dec 9, 2025
Author

NeoZhangJianyu
Dec 10, 2025
Collaborator

the-bort-the
Dec 12, 2025
Author

NeoZhangJianyu
Dec 12, 2025
Collaborator

the-bort-the
Dec 12, 2025
Author

NeoZhangJianyu
Dec 12, 2025
Collaborator

the-bort-the
Dec 12, 2025
Author