hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET#5
Open
hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET#5
Conversation
…rg#22818) * convert : fix RuntimeError when stripping FP8 KV-cache scales In ModelBase._generate_nvfp4_tensors the final cleanup loop iterates self.model_tensors.keys() and calls del on the same dict, which raises RuntimeError: dictionary changed size during iteration when a ModelOpt NVFP4 model also has FP8 KV-cache scales (e.g. mmangkad/Qwen3.6-35B-A3B-NVFP4 and any modelopt config with kv_cache_quant_algo: FP8). Wrap the keys view in list() so the deletions happen on a snapshot. * re-add another accidentally removed list --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Q4_0 MoE CLC pass sanity check * release program * opencl: fix whitespace * opencl: remove unused cl_program * opencl: break #if block to make it more clear * opencl: adjust format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>
…ture (ggml-org#22803) * refactor: Settings keys as constant object keys * chore: Run `npm audit fix` * refactor: Settings Sections UI * feat: Refactor Settings structure and implement import/export logic * feat: Introduce ROUTES constant and RouterService * refactor: Consolidate settings definitions into registry * refactor: Update settings page routing structure * chore: Migrate hardcoded URLs to use ROUTES and RouterService * feat: Enhance model selection logic for settings and chat * chore: Update webui static build * refactor: Address PR review comments * fix: Remove unneeded setting * fix: Re-add missing settings * fix: Add missing `/slots` proxy for webui dev mode * chore: Dev-mode logs * fix: Data binding * fix: Steering for non-agentic flow
* cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16. * cuda: address review feedback from @am17an Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions. * cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an * Update tests/test-backend-ops.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cuda: snake fusion check add->type matches x->type Address review feedback from @am17an * cuda: snake fusion check add->type matches x->type Moved for readability (equivalent) Address review feedback from @am17an --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>
…ml-org#22683) * server: (router) expose child model info from router's /v1/models * update docs
* server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes Signed-off-by: ynankani <ynankani@nvidia.com> * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address review comments Signed-off-by: ynankani <ynankani@nvidia.com> * fix CRLF Signed-off-by: ynankani <ynankani@nvidia.com> * Lint error fix Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* L2_NORM Updates * Addressed PR Comments * ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend * hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Implement the Gated Delta Net recurrence on HVX with: - 4-row fused kernels for PP (prompt processing) path - 8-row fused kernels for TG (token generation) path, reducing K/Q/gate vector reload overhead by 2x - Separate PP/TG thread functions for I-cache isolation - VTCM state scratchpad with DMA in/out for TG single-cycle access - Vectorized gate exp via hvx_exp_f32
2abd0e9 to
e03cae2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Add a high-performance HVX kernel for
GGML_OP_GATED_DELTA_NETon Hexagon HTP, enabling Gated Delta Net models (e.g. Qwen3.5) to run the recurrence entirely on-device instead of falling back to CPU.Key optimizations:
hvx_exp_f32(): 32 floats/vector vs scalarexpf()Performance vs HTP without this kernel (op falls back to CPU), measured against master @
d77599234:Measured on Snapdragon SM8850 (HTP V81) and SM8650 (HTP V75).
Additional information
Implementation details
Gated Delta Net maintains a per-head state matrix
Sof sizeS_v × S_v(64×64 for 0.8B, 128×128 for 4B). Each token updates:The kernel processes heads in parallel across HVX threads. PP (
n_tokens > 1) uses 4-row fused kernels (gdn_mul_dot4_f32,gdn_mul_scalar_dot4_f32,gdn_add_scaled_dot4_f32); TG (n_tokens == 1) uses 8-row fused kernels (gdn_mul_dot8_f32, …). Each fused kernel performs gate-multiply, K-dot, rank-1 update, and Q-dot across multiple state rows in a single vector pass, amortizing the K/Q/gate load cost. Inner loops use full-vector unaligned stores (hvx_vmemu(dst+i*epv) = out) on the hot path, with a masked partial store for any trailing < 128-byte chunk; this avoids thevlalignrotation and dual-predicate path ofhvx_vec_store_u(..., 128, …). The supported range isS_v ≤ HTP_GDN_MAX_SV (= 128); larger states fall back to CPU.For TG mode, the state matrix is staged into VTCM (8MB on V81) at the start of each chunk and copied back at the end:
Benchmark results
All results measured against master @
d77599234. The PR build includes the full-vector store + vectorized tail optimizations applied after PR review feedback (hvx_vmemudirect full-vector unaligned stores; mask + partial store for tail elements).Snapdragon 8 Elite Gen 5 (SM8850, HTP V81)
Config:
--device HTP0 --mmap 0 --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ubatch-size 256 -fa 1 -ngl 99Qwen3.5-0.8B Q4_0 (S_v=64, NDEV=1)
Qwen3.5-0.8B Q8_0 (S_v=64, NDEV=1)
Qwen3.5-4B Q4_0 (S_v=128, NDEV=1)
Snapdragon 8 Gen 3 (SM8650, HTP V75)
Config:
--device HTP0 --mmap 0 --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ubatch-size 256 -fa 1 -ngl 99Qwen3.5-0.8B Q4_0 (S_v=64, NDEV=1)
Qwen3.5-0.8B Q8_0 (S_v=64, NDEV=1)
Qwen3.5-4B Q4_0 (S_v=128, NDEV=1)
Correctness test
Verified with
test-backend-opson SM8850 (HTP V81), comparing HTP output against CPU reference (NMSE threshold 1e-7):How to reproduce
Build:
cmake --preset arm64-android-snapdragon-release -B build-snapdragon cmake --build build-snapdragon -j$(nproc) cmake --install build-snapdragon --prefix pkg-snapdragon/llama.cppPush and run:
For 4B Q8_0 (requires NDEV=2 to fit within the per-session 3.5 GB limit):
Note: with NDEV=2 and 4B Q8_0, tg128 runs at ~6 t/s and takes ~2 minutes to complete. Allow adequate time.
Key env vars:
GGML_HEXAGON_NDEV=2(for >3.5GB models).Requirements