ggml-hexagon: gelu operation #17921

joeldushouyu · 2025-12-10T23:36:27Z

Support GELU operation for ggml-hexagon.

joeldushouyu · 2025-12-11T21:36:14Z

While both commit 2a787a6 and 83412e0 is a fully functional gelu implementation that passed the official ggml test by running, but there is a significant performance difference between the two and maybe worth an discussion?

HB=0  ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELU

The GELU code from commit 2a787a6 is simply a sigmoid GELU implementation but commit 83412e0 is a 7th polynomial approximation of GELU generated from the qhcg tool from Hexagon SDK with some modification.

When running on input of size [4096x4304] ( non-linear input data size for gemma3 vision model), I got some significant performance between this two implementation.

data-size	gelu-sigmoid	gelu-polynomial approximation
4096x4304( unaligned address)	11772 usec	6531usec
4096x4096 (aligned address)	4988 usec	4799 usec

The data above are tested on Samsung galaxy s25 ultra using the test repo I wrote : current on refactor-dev branch.

For the usec second above, I recorded the longest usec among the 6 threads that is printed out to FARF log

In addition, when plotting out the polynomial approximation using my plot script I wrote in my test repo, I did not much if any error between the polynimal approximation version vs the CPU reference.

joeldushouyu · 2025-12-11T22:19:11Z

After revisiting the polynomial-approximation implementation, I noticed there was a block-prefetch operation in the code which I have commented out in commit 7233999 for a fair comparison. With that in mind, here are the updated results:

data-size	GELU-sigmoid	GELU polynomial approximation
4096×4304 (unaligned)	11833 µs	8680 µs
4096×4096 (aligned)	5006 µs	7990 µs

From the new testbench runs, a few things stand out:

The 7th-order polynomial path is actually slower computationally than the sigmoid-based GELU.
The current L2-prefetch logic used in the regular SiLU, GELU, and SwiGLU kernels appears to be either underutilized or triggered too late. This likely explains why the polynomial approximation—with its more aggressive L2 prefetching—outperforms the sigmoid-GELU in the unaligned case.
The unaligned version of the sigmoid-GELU could potentially benefit from adopting the same unaligned-load strategy used in the polynomial-approximation path.

For the remainder of this PR, I plan to:

refine the L2-prefetching strategy in sigmoild-gelu, and
apply the polynomial-approximation’s unaligned-load approach to the sigmoid-GELU path

to see whether we can push sigmoid-GELU performance ahead of the polynomial method.

May I kindly ask for your thoughts and suggestions @max-krasnyansky ?

This reverts commit 7233999.

This reverts commit 2a787a6.

joeldushouyu · 2025-12-12T17:27:36Z

For commit fc2289d, I noticed a significant performance gap between the two sigmoid implementations that handle unaligned input vectors.

The first version treats x as an aligned HVX_Vector pointer and uses Q6_V_valign_VVR to handle any address misalignment, while the second version treats the input as an unaligned HVX_UVector pointer.

On my local device, when used inside the GELU kernel, the first implementation runs in about 6400–6500 µs, whereas the unaligned version takes around 7300–7400 µs for an input of size 4096 × 4304. This seems consistent with my point (3) above? It would be good to have someone else verify these numbers, but if they hold, it might be worth applying the same approach to other functions such as hvx_mul_f32 and hvx_add_f32?

feat: inital support for gelu using sigmoid approximation

83412e0

loci-dev mentioned this pull request Dec 11, 2025

UPSTREAM PR #17921: ggml-hexagon: gelu operation auroralabs-loci/llama.cpp#519

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 11, 2025

snapshot: faster gelu using polynomial approximation

2a787a6

test: disable l2-block prefetch in polynomail approximation

7233999

joeldushouyu added 4 commits December 12, 2025 10:03

Revert "test: disable l2-block prefetch in polynomail approximation"

470b499

This reverts commit 7233999.

Revert "snapshot: faster gelu using polynomial approximation"

999492f

This reverts commit 2a787a6.

debug: temporarily disable unnecessary log message for debug purpose

84f2f23

Feat: optiized unaligned sigmoid_f32

fc2289d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-hexagon: gelu operation #17921

ggml-hexagon: gelu operation #17921

joeldushouyu commented Dec 10, 2025

Uh oh!

joeldushouyu commented Dec 11, 2025

Uh oh!

joeldushouyu commented Dec 11, 2025 •

edited

Loading

Uh oh!

joeldushouyu commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ggml-hexagon: gelu operation #17921

Are you sure you want to change the base?

ggml-hexagon: gelu operation #17921

Conversation

joeldushouyu commented Dec 10, 2025

Uh oh!

joeldushouyu commented Dec 11, 2025

Uh oh!

joeldushouyu commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeldushouyu commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joeldushouyu commented Dec 11, 2025 •

edited

Loading