hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET by wyanzhao · Pull Request #5 · qualcomm/llama.cpp

wyanzhao · 2026-05-08T11:18:53Z

Overview

Add a high-performance HVX kernel for GGML_OP_GATED_DELTA_NET on Hexagon HTP, enabling Gated Delta Net models (e.g. Qwen3.5) to run the recurrence entirely on-device instead of falling back to CPU.

Key optimizations:

Fused multi-row kernels (4-row for PP, 8-row for TG): reduces K/Q/gate vector reload overhead by 2–4×
Split PP/TG thread functions: prevents 8-row code from polluting PP I-cache
VTCM state scratchpad (TG only): DMA state into VTCM for single-cycle access
Vectorized gate exp via hvx_exp_f32(): 32 floats/vector vs scalar expf()

Performance vs HTP without this kernel (op falls back to CPU), measured against master @ d77599234:

Model	Device	PP128	TG64
Qwen3.5-0.8B Q4_0	V81	+71.1%	+62.2%
Qwen3.5-0.8B Q8_0	V81	+92.5%	+64.3%
Qwen3.5-4B Q4_0	V81	+52.2%	+51.9%
Qwen3.5-0.8B Q4_0	V75	+50.0%	+34.8%
Qwen3.5-0.8B Q8_0	V75	+45.2%	+21.1%
Qwen3.5-4B Q4_0	V75	+44.5%	+24.0%

Measured on Snapdragon SM8850 (HTP V81) and SM8650 (HTP V75).

Additional information

Implementation details

Gated Delta Net maintains a per-head state matrix S of size S_v × S_v (64×64 for 0.8B, 128×128 for 4B). Each token updates:

S[j] = gate[j] * S[j]                     # decay
delta[j] = (v[j] - dot(S[j], k)) * beta   # compute delta
S[j] += delta[j] * k                       # rank-1 update
attn[j] = dot(S[j], q) * scale             # query

The kernel processes heads in parallel across HVX threads. PP (n_tokens > 1) uses 4-row fused kernels (gdn_mul_dot4_f32, gdn_mul_scalar_dot4_f32, gdn_add_scaled_dot4_f32); TG (n_tokens == 1) uses 8-row fused kernels (gdn_mul_dot8_f32, …). Each fused kernel performs gate-multiply, K-dot, rank-1 update, and Q-dot across multiple state rows in a single vector pass, amortizing the K/Q/gate load cost. Inner loops use full-vector unaligned stores (hvx_vmemu(dst+i*epv) = out) on the hot path, with a masked partial store for any trailing < 128-byte chunk; this avoids the vlalign rotation and dual-predicate path of hvx_vec_store_u(..., 128, …). The supported range is S_v ≤ HTP_GDN_MAX_SV (= 128); larger states fall back to CPU.

For TG mode, the state matrix is staged into VTCM (8MB on V81) at the start of each chunk and copied back at the end:

S_v=64: 16KB/head, 8 threads × 16KB = 128KB fits
S_v=128: 64KB/head, 8 threads × 64KB = 512KB fits

Benchmark results

All results measured against master @ d77599234. The PR build includes the full-vector store + vectorized tail optimizations applied after PR review feedback (hvx_vmemu direct full-vector unaligned stores; mask + partial store for tail elements).

Snapdragon 8 Elite Gen 5 (SM8850, HTP V81)

Config: --device HTP0 --mmap 0 --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ubatch-size 256 -fa 1 -ngl 99

Qwen3.5-0.8B Q4_0 (S_v=64, NDEV=1)

Test	master (t/s)	This PR (t/s)	Improvement
pp64	195.65	299.12	+52.9%
pp128	250.38	428.38	+71.1%
pp256	286.72	526.02	+83.5%
pp512	274.38	491.94	+79.3%
tg64	11.88	19.26	+62.2%
tg128	11.83	19.16	+61.9%

Qwen3.5-0.8B Q8_0 (S_v=64, NDEV=1)

Test	master (t/s)	This PR (t/s)	Improvement
pp64	171.75	294.99	+71.8%
pp128	229.61	442.10	+92.5%
pp256	266.20	559.85	+110.3%
pp512	259.10	529.17	+104.2%
tg64	10.94	17.98	+64.3%
tg128	11.27	17.94	+59.2%

Qwen3.5-4B Q4_0 (S_v=128, NDEV=1)

Test	master (t/s)	This PR (t/s)	Improvement
pp64	48.58	66.49	+36.9%
pp128	69.22	105.38	+52.2%
pp256	74.53	112.86	+51.4%
pp512	68.37	107.41	+57.1%
tg64	4.51	6.85	+51.9%
tg128	4.50	6.85	+52.0%

Snapdragon 8 Gen 3 (SM8650, HTP V75)

Config: --device HTP0 --mmap 0 --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ubatch-size 256 -fa 1 -ngl 99

Qwen3.5-0.8B Q4_0 (S_v=64, NDEV=1)

Test	master (t/s)	This PR (t/s)	Improvement
pp64	67.00	95.22	+42.1%
pp128	80.37	120.60	+50.0%
pp256	92.76	134.14	+44.6%
pp512	91.09	133.03	+46.1%
tg64	7.58	10.22	+34.8%
tg128	8.60	10.26	+19.3%

Qwen3.5-0.8B Q8_0 (S_v=64, NDEV=1)

Test	master (t/s)	This PR (t/s)	Improvement
pp64	69.38	99.66	+43.7%
pp128	83.22	120.83	+45.2%
pp256	95.01	154.31	+62.4%
pp512	93.33	150.58	+61.4%
tg64	8.21	9.94	+21.1%
tg128	8.34	10.02	+20.2%

Qwen3.5-4B Q4_0 (S_v=128, NDEV=1)

Test	master (t/s)	This PR (t/s)	Improvement
pp64	21.54	29.00	+34.6%
pp128	27.59	39.87	+44.5%
pp256	28.44	40.15	+41.2%
pp512	28.23	39.85	+41.2%
tg64	2.82	3.50	+24.0%
tg128	3.04	3.65	+19.9%

Correctness test

Verified with test-backend-ops on SM8850 (HTP V81), comparing HTP output against CPU reference (NMSE threshold 1e-7):

adb shell "cd /data/local/tmp/llama.cpp && \
  LD_LIBRARY_PATH=./lib ADSP_LIBRARY_PATH=./lib \
  ./bin/test-backend-ops test -b HTP0 -o GATED_DELTA_NET"

Backend 1/2: HTP0
  Device description: Hexagon
  Device memory: 0 MB (0 MB free)

  GATED_DELTA_NET(type=f32,head_count=32,head_size=128,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=1,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=16,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=1,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK
  18/18 tests passed
  Backend HTP0: OK
2/2 backends passed
OK

How to reproduce

Build:

cmake --preset arm64-android-snapdragon-release -B build-snapdragon
cmake --build build-snapdragon -j$(nproc)
cmake --install build-snapdragon --prefix pkg-snapdragon/llama.cpp

Push and run:

adb push pkg-snapdragon/llama.cpp /data/local/tmp/
adb push <path-to-gguf>/Qwen3.5-0.8B-Q8_0.gguf /data/local/tmp/gguf/

adb shell "
  cd /data/local/tmp/llama.cpp;
  LD_LIBRARY_PATH=./lib
  ADSP_LIBRARY_PATH=./lib
    ./bin/llama-bench --device HTP0 --mmap 0 \
        -m /data/local/tmp/gguf/Qwen3.5-0.8B-Q8_0.gguf \
        --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 \
        --ubatch-size 256 -fa 1 -ngl 99 \
        -p 128 -n 128
"

For 4B Q8_0 (requires NDEV=2 to fit within the per-session 3.5 GB limit):

adb shell "
  cd /data/local/tmp/llama.cpp;
  LD_LIBRARY_PATH=./lib
  ADSP_LIBRARY_PATH=./lib
  GGML_HEXAGON_NDEV=2
    ./bin/llama-bench --device HTP0,HTP1 --mmap 0 \
        -m /data/local/tmp/gguf/Qwen3.5-4B-Q8_0.gguf \
        --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 \
        --ubatch-size 256 -fa 1 -ngl 99 \
        -p 128 -n 128
"

Note: with NDEV=2 and 4B Q8_0, tg128 runs at ~6 t/s and takes ~2 minutes to complete. Allow adequate time.

Key env vars: GGML_HEXAGON_NDEV=2 (for >3.5GB models).

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, used Claude to generate the initial version, then reviewed/tested/optimized-further manually.

…rg#22818) * convert : fix RuntimeError when stripping FP8 KV-cache scales In ModelBase._generate_nvfp4_tensors the final cleanup loop iterates self.model_tensors.keys() and calls del on the same dict, which raises RuntimeError: dictionary changed size during iteration when a ModelOpt NVFP4 model also has FP8 KV-cache scales (e.g. mmangkad/Qwen3.6-35B-A3B-NVFP4 and any modelopt config with kv_cache_quant_algo: FP8). Wrap the keys view in list() so the deletions happen on a snapshot. * re-add another accidentally removed list --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Q4_0 MoE CLC pass sanity check * release program * opencl: fix whitespace * opencl: remove unused cl_program * opencl: break #if block to make it more clear * opencl: adjust format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

…ture (ggml-org#22803) * refactor: Settings keys as constant object keys * chore: Run `npm audit fix` * refactor: Settings Sections UI * feat: Refactor Settings structure and implement import/export logic * feat: Introduce ROUTES constant and RouterService * refactor: Consolidate settings definitions into registry * refactor: Update settings page routing structure * chore: Migrate hardcoded URLs to use ROUTES and RouterService * feat: Enhance model selection logic for settings and chat * chore: Update webui static build * refactor: Address PR review comments * fix: Remove unneeded setting * fix: Re-add missing settings * fix: Add missing `/slots` proxy for webui dev mode * chore: Dev-mode logs * fix: Data binding * fix: Steering for non-agentic flow

@am17an

* cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16. * cuda: address review feedback from @am17an Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions. * cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an * Update tests/test-backend-ops.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cuda: snake fusion check add->type matches x->type Address review feedback from @am17an * cuda: snake fusion check add->type matches x->type Moved for readability (equivalent) Address review feedback from @am17an --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

…ml-org#22683) * server: (router) expose child model info from router's /v1/models * update docs

* server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build

…2840)

* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes Signed-off-by: ynankani <ynankani@nvidia.com> * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address review comments Signed-off-by: ynankani <ynankani@nvidia.com> * fix CRLF Signed-off-by: ynankani <ynankani@nvidia.com> * Lint error fix Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…gml-org#22827)

* L2_NORM Updates * Addressed PR Comments * ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend * hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

Implement the Gated Delta Net recurrence on HVX with: - 4-row fused kernels for PP (prompt processing) path - 8-row fused kernels for TG (token generation) path, reducing K/Q/gate vector reload overhead by 2x - Separate PP/TG thread functions for I-cache isolation - VTCM state scratchpad with DMA in/out for TG single-cycle access - Vectorized gate exp via hvx_exp_f32

arthw and others added 8 commits May 8, 2026 06:54

fix script error (#22795sycl : )

6a2a251

ggml: update SCHED_DEBUG output to use ggml_op_desc() (ggml-org#22825)

3e941b8

vulkan: fix spv shadowing (ggml-org#22760)

6d57a49

CUDA: lower-case PCI bus id, standardize for ggml (ggml-org#22820)

a8fd165

wyanzhao requested a review from max-krasnyansky May 8, 2026 11:19

wyanzhao self-assigned this May 8, 2026

github-actions Bot added ggml Hexagon labels May 8, 2026

ngxson and others added 8 commits May 8, 2026 14:42

server: (router) expose child model info from router's /v1/models (gg…

9dcf835

…ml-org#22683) * server: (router) expose child model info from router's /v1/models * update docs

server: support Vertex AI compatible API (ggml-org#22545)

29debb3

* server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build

webui: fix LLM title generation for agentic conversations (ggml-org#2…

5d6f18a

…2840)

common : revert reasoning budget +inf logit bias (ggml-org#22740)

f9cd456

common : do not wrap raw strings in schema parser for tagged parsers (g…

4995604

…gml-org#22827)

max-krasnyansky force-pushed the hexagon-gated-delta-net branch from 2abd0e9 to e03cae2 Compare May 8, 2026 22:46

github-actions Bot added SYCL Nvidia GPU Vulkan testing examples python server/webui server model labels May 8, 2026

github-actions Bot added the OpenCL label May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET#5

hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET#5
wyanzhao wants to merge 16 commits intomasterfrom
hexagon-gated-delta-net

wyanzhao commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

wyanzhao commented May 8, 2026

Overview

Additional information

Implementation details

Benchmark results

Snapdragon 8 Elite Gen 5 (SM8850, HTP V81)

Qwen3.5-0.8B Q4_0 (S_v=64, NDEV=1)

Qwen3.5-0.8B Q8_0 (S_v=64, NDEV=1)

Qwen3.5-4B Q4_0 (S_v=128, NDEV=1)

Snapdragon 8 Gen 3 (SM8650, HTP V75)

Qwen3.5-0.8B Q4_0 (S_v=64, NDEV=1)

Qwen3.5-0.8B Q8_0 (S_v=64, NDEV=1)

Qwen3.5-4B Q4_0 (S_v=128, NDEV=1)

Correctness test

How to reproduce

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants