Skip to content

hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET#5

Open
wyanzhao wants to merge 16 commits intomasterfrom
hexagon-gated-delta-net
Open

hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET#5
wyanzhao wants to merge 16 commits intomasterfrom
hexagon-gated-delta-net

Conversation

@wyanzhao
Copy link
Copy Markdown
Member

@wyanzhao wyanzhao commented May 8, 2026

Overview

Add a high-performance HVX kernel for GGML_OP_GATED_DELTA_NET on Hexagon HTP, enabling Gated Delta Net models (e.g. Qwen3.5) to run the recurrence entirely on-device instead of falling back to CPU.

Key optimizations:

  • Fused multi-row kernels (4-row for PP, 8-row for TG): reduces K/Q/gate vector reload overhead by 2–4×
  • Split PP/TG thread functions: prevents 8-row code from polluting PP I-cache
  • VTCM state scratchpad (TG only): DMA state into VTCM for single-cycle access
  • Vectorized gate exp via hvx_exp_f32(): 32 floats/vector vs scalar expf()

Performance vs HTP without this kernel (op falls back to CPU), measured against master @ d77599234:

Model Device PP128 TG64
Qwen3.5-0.8B Q4_0 V81 +71.1% +62.2%
Qwen3.5-0.8B Q8_0 V81 +92.5% +64.3%
Qwen3.5-4B Q4_0 V81 +52.2% +51.9%
Qwen3.5-0.8B Q4_0 V75 +50.0% +34.8%
Qwen3.5-0.8B Q8_0 V75 +45.2% +21.1%
Qwen3.5-4B Q4_0 V75 +44.5% +24.0%

Measured on Snapdragon SM8850 (HTP V81) and SM8650 (HTP V75).

Additional information

Implementation details

Gated Delta Net maintains a per-head state matrix S of size S_v × S_v (64×64 for 0.8B, 128×128 for 4B). Each token updates:

S[j] = gate[j] * S[j]                     # decay
delta[j] = (v[j] - dot(S[j], k)) * beta   # compute delta
S[j] += delta[j] * k                       # rank-1 update
attn[j] = dot(S[j], q) * scale             # query

The kernel processes heads in parallel across HVX threads. PP (n_tokens > 1) uses 4-row fused kernels (gdn_mul_dot4_f32, gdn_mul_scalar_dot4_f32, gdn_add_scaled_dot4_f32); TG (n_tokens == 1) uses 8-row fused kernels (gdn_mul_dot8_f32, …). Each fused kernel performs gate-multiply, K-dot, rank-1 update, and Q-dot across multiple state rows in a single vector pass, amortizing the K/Q/gate load cost. Inner loops use full-vector unaligned stores (hvx_vmemu(dst+i*epv) = out) on the hot path, with a masked partial store for any trailing < 128-byte chunk; this avoids the vlalign rotation and dual-predicate path of hvx_vec_store_u(..., 128, …). The supported range is S_v ≤ HTP_GDN_MAX_SV (= 128); larger states fall back to CPU.

For TG mode, the state matrix is staged into VTCM (8MB on V81) at the start of each chunk and copied back at the end:

  • S_v=64: 16KB/head, 8 threads × 16KB = 128KB fits
  • S_v=128: 64KB/head, 8 threads × 64KB = 512KB fits

Benchmark results

All results measured against master @ d77599234. The PR build includes the full-vector store + vectorized tail optimizations applied after PR review feedback (hvx_vmemu direct full-vector unaligned stores; mask + partial store for tail elements).

Snapdragon 8 Elite Gen 5 (SM8850, HTP V81)

Config: --device HTP0 --mmap 0 --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ubatch-size 256 -fa 1 -ngl 99

Qwen3.5-0.8B Q4_0 (S_v=64, NDEV=1)
Test master (t/s) This PR (t/s) Improvement
pp64 195.65 299.12 +52.9%
pp128 250.38 428.38 +71.1%
pp256 286.72 526.02 +83.5%
pp512 274.38 491.94 +79.3%
tg64 11.88 19.26 +62.2%
tg128 11.83 19.16 +61.9%
Qwen3.5-0.8B Q8_0 (S_v=64, NDEV=1)
Test master (t/s) This PR (t/s) Improvement
pp64 171.75 294.99 +71.8%
pp128 229.61 442.10 +92.5%
pp256 266.20 559.85 +110.3%
pp512 259.10 529.17 +104.2%
tg64 10.94 17.98 +64.3%
tg128 11.27 17.94 +59.2%
Qwen3.5-4B Q4_0 (S_v=128, NDEV=1)
Test master (t/s) This PR (t/s) Improvement
pp64 48.58 66.49 +36.9%
pp128 69.22 105.38 +52.2%
pp256 74.53 112.86 +51.4%
pp512 68.37 107.41 +57.1%
tg64 4.51 6.85 +51.9%
tg128 4.50 6.85 +52.0%

Snapdragon 8 Gen 3 (SM8650, HTP V75)

Config: --device HTP0 --mmap 0 --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ubatch-size 256 -fa 1 -ngl 99

Qwen3.5-0.8B Q4_0 (S_v=64, NDEV=1)
Test master (t/s) This PR (t/s) Improvement
pp64 67.00 95.22 +42.1%
pp128 80.37 120.60 +50.0%
pp256 92.76 134.14 +44.6%
pp512 91.09 133.03 +46.1%
tg64 7.58 10.22 +34.8%
tg128 8.60 10.26 +19.3%
Qwen3.5-0.8B Q8_0 (S_v=64, NDEV=1)
Test master (t/s) This PR (t/s) Improvement
pp64 69.38 99.66 +43.7%
pp128 83.22 120.83 +45.2%
pp256 95.01 154.31 +62.4%
pp512 93.33 150.58 +61.4%
tg64 8.21 9.94 +21.1%
tg128 8.34 10.02 +20.2%
Qwen3.5-4B Q4_0 (S_v=128, NDEV=1)
Test master (t/s) This PR (t/s) Improvement
pp64 21.54 29.00 +34.6%
pp128 27.59 39.87 +44.5%
pp256 28.44 40.15 +41.2%
pp512 28.23 39.85 +41.2%
tg64 2.82 3.50 +24.0%
tg128 3.04 3.65 +19.9%

Correctness test

Verified with test-backend-ops on SM8850 (HTP V81), comparing HTP output against CPU reference (NMSE threshold 1e-7):

adb shell "cd /data/local/tmp/llama.cpp && \
  LD_LIBRARY_PATH=./lib ADSP_LIBRARY_PATH=./lib \
  ./bin/test-backend-ops test -b HTP0 -o GATED_DELTA_NET"
Backend 1/2: HTP0
  Device description: Hexagon
  Device memory: 0 MB (0 MB free)

  GATED_DELTA_NET(type=f32,head_count=32,head_size=128,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=1,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=16,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=1,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK
  18/18 tests passed
  Backend HTP0: OK
2/2 backends passed
OK

How to reproduce

  1. Build:

    cmake --preset arm64-android-snapdragon-release -B build-snapdragon
    cmake --build build-snapdragon -j$(nproc)
    cmake --install build-snapdragon --prefix pkg-snapdragon/llama.cpp
  2. Push and run:

    adb push pkg-snapdragon/llama.cpp /data/local/tmp/
    adb push <path-to-gguf>/Qwen3.5-0.8B-Q8_0.gguf /data/local/tmp/gguf/
    
    adb shell "
      cd /data/local/tmp/llama.cpp;
      LD_LIBRARY_PATH=./lib
      ADSP_LIBRARY_PATH=./lib
        ./bin/llama-bench --device HTP0 --mmap 0 \
            -m /data/local/tmp/gguf/Qwen3.5-0.8B-Q8_0.gguf \
            --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 \
            --ubatch-size 256 -fa 1 -ngl 99 \
            -p 128 -n 128
    "

    For 4B Q8_0 (requires NDEV=2 to fit within the per-session 3.5 GB limit):

    adb shell "
      cd /data/local/tmp/llama.cpp;
      LD_LIBRARY_PATH=./lib
      ADSP_LIBRARY_PATH=./lib
      GGML_HEXAGON_NDEV=2
        ./bin/llama-bench --device HTP0,HTP1 --mmap 0 \
            -m /data/local/tmp/gguf/Qwen3.5-4B-Q8_0.gguf \
            --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 \
            --ubatch-size 256 -fa 1 -ngl 99 \
            -p 128 -n 128
    "

    Note: with NDEV=2 and 4B Q8_0, tg128 runs at ~6 t/s and takes ~2 minutes to complete. Allow adequate time.

Key env vars: GGML_HEXAGON_NDEV=2 (for >3.5GB models).

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, used Claude to generate the initial version, then reviewed/tested/optimized-further manually.

arthw and others added 8 commits May 8, 2026 06:54
…rg#22818)

* convert : fix RuntimeError when stripping FP8 KV-cache scales

In ModelBase._generate_nvfp4_tensors the final cleanup loop iterates
self.model_tensors.keys() and calls del on the same dict, which raises
RuntimeError: dictionary changed size during iteration when a ModelOpt
NVFP4 model also has FP8 KV-cache scales (e.g. mmangkad/Qwen3.6-35B-A3B-NVFP4
and any modelopt config with kv_cache_quant_algo: FP8).

Wrap the keys view in list() so the deletions happen on a snapshot.

* re-add another accidentally removed list

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Q4_0 MoE CLC pass sanity check

* release program

* opencl: fix whitespace

* opencl: remove unused cl_program

* opencl: break #if block to make it more clear

* opencl: adjust format

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
…ture (ggml-org#22803)

* refactor: Settings keys as constant object keys

* chore: Run `npm audit fix`

* refactor: Settings Sections UI

* feat: Refactor Settings structure and implement import/export logic

* feat: Introduce ROUTES constant and RouterService

* refactor: Consolidate settings definitions into registry

* refactor: Update settings page routing structure

* chore: Migrate hardcoded URLs to use ROUTES and RouterService

* feat: Enhance model selection logic for settings and chat

* chore: Update webui static build

* refactor: Address PR review comments

* fix: Remove unneeded setting

* fix: Re-add missing settings

* fix: Add missing `/slots` proxy for webui dev mode

* chore: Dev-mode logs

* fix: Data binding

* fix: Steering for non-agentic flow
* cuda: fuse snake activation (mul, sin, sqr, mul, add)

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.

* cuda: address review feedback from @am17an

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.

* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an

* Update tests/test-backend-ops.cpp

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cuda: snake fusion check add->type matches x->type

Address review feedback from @am17an

* cuda: snake fusion check add->type matches x->type

Moved for readability (equivalent)
Address review feedback from @am17an

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
@wyanzhao wyanzhao requested a review from max-krasnyansky May 8, 2026 11:19
@wyanzhao wyanzhao self-assigned this May 8, 2026
ngxson and others added 8 commits May 8, 2026 14:42
…ml-org#22683)

* server: (router) expose child model info from router's /v1/models

* update docs
* server: support Vertex AI compatible API

* a bit safer

* support other AIP_* env var

* various fixes

* if AIP_MODE is unset, do nothing

* fix test case

* fix windows build
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

Signed-off-by: ynankani <ynankani@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address review comments

Signed-off-by: ynankani <ynankani@nvidia.com>

* fix CRLF

Signed-off-by: ynankani <ynankani@nvidia.com>

* Lint error fix

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* L2_NORM Updates

* Addressed PR Comments

* ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend

* hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Implement the Gated Delta Net recurrence on HVX with:
- 4-row fused kernels for PP (prompt processing) path
- 8-row fused kernels for TG (token generation) path, reducing
  K/Q/gate vector reload overhead by 2x
- Separate PP/TG thread functions for I-cache isolation
- VTCM state scratchpad with DMA in/out for TG single-cycle access
- Vectorized gate exp via hvx_exp_f32
@github-actions github-actions Bot added the OpenCL label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.