Running Llama4 quantized on 2xH100 80GB #17628

ilyabcodin · 2025-05-04T08:23:59Z

ilyabcodin
May 4, 2025

Hey everyone,
Has anyone had success running Llama4 with something like fp8 or experts_int8 quantization?

Currently 4 is a little bit much, and 3 is impossible due to Llama's architecture, so I've been experimenting with different quant types to be able to run on 2xH100 (160GB total) GPUs, but to no success - running into CUDA out of memory.

I thought int8 would be sufficient since it should result in roughly a halved parameter size (i.e. 110GB VRAM required), but it didn't work.

aniruddhaadak80 · 2026-03-09T22:51:28Z

aniruddhaadak80
Mar 9, 2026

From my point of view, the gap between rough weight size math and actual runtime memory is probably the core issue. Mixture-of-experts routing, KV cache growth, allocator fragmentation, and parallelism strategy can easily consume the margin that made int8 look sufficient on paper.

I would expect a useful answer here to talk not only about quant format, but also tensor parallel layout, max context, and how much cache budget remains after the model is loaded.

0 replies

reallyticsai · 2026-04-07T12:14:53Z

reallyticsai
Apr 7, 2026

We've run into similar constraints deploying large Llama models on H100s, even with aggressive quantization. int8 quantization does cut parameter size roughly in half, but memory usage also depends on activations, optimizer states, and attention cache—especially for vLLM's efficient KV cache implementation. For Llama4, even if parameters fit, activations during inference can spike VRAM usage.

fp8 isn't widely supported for Llama yet; most quant libraries like bitsandbytes or auto-gptq focus on int8/4/3. For int8, we've had best results with auto-gptq, but even then, vLLM sometimes hits CUDA OOM because it preloads weights and allocates KV cache up front. Try tuning vLLM's max_num_seqs and block_size parameters to reduce KV cache memory. Example:

model_parallel_size: 2
max_num_seqs: 16   # Lower this if you're batch serving
block_size: 32     # Smaller block sizes help with memory spikes

Also, double-check the quantized checkpoint—some formats include extra metadata or fp16 weights that vLLM loads by default. We’ve had success stripping those out with:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('your_quantized_checkpoint', device_map='auto', torch_dtype=torch.int8)
model.save_pretrained('cleaned_quant')

If you still hit OOM, try the most aggressive quant (int4 with GPTQ). On 2xH100s, we’ve managed up to Llama2-70B int4, but Llama4’s size might push the limits even further. Let me know what quant method and checkpoint format you're using—sometimes the culprit is actually hidden extra fp16 weights.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running Llama4 quantized on 2xH100 80GB #17628

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Running Llama4 quantized on 2xH100 80GB #17628

Uh oh!

ilyabcodin May 4, 2025

Replies: 2 comments

Uh oh!

aniruddhaadak80 Mar 9, 2026

Uh oh!

reallyticsai Apr 7, 2026

ilyabcodin
May 4, 2025

aniruddhaadak80
Mar 9, 2026

reallyticsai
Apr 7, 2026