Running Llama4 quantized on 2xH100 80GB #17628
Replies: 2 comments
-
|
From my point of view, the gap between rough weight size math and actual runtime memory is probably the core issue. Mixture-of-experts routing, KV cache growth, allocator fragmentation, and parallelism strategy can easily consume the margin that made int8 look sufficient on paper. I would expect a useful answer here to talk not only about quant format, but also tensor parallel layout, max context, and how much cache budget remains after the model is loaded. |
Beta Was this translation helpful? Give feedback.
-
|
We've run into similar constraints deploying large Llama models on H100s, even with aggressive quantization. int8 quantization does cut parameter size roughly in half, but memory usage also depends on activations, optimizer states, and attention cache—especially for vLLM's efficient KV cache implementation. For Llama4, even if parameters fit, activations during inference can spike VRAM usage. fp8 isn't widely supported for Llama yet; most quant libraries like bitsandbytes or auto-gptq focus on int8/4/3. For int8, we've had best results with auto-gptq, but even then, vLLM sometimes hits CUDA OOM because it preloads weights and allocates KV cache up front. Try tuning vLLM's model_parallel_size: 2
max_num_seqs: 16 # Lower this if you're batch serving
block_size: 32 # Smaller block sizes help with memory spikesAlso, double-check the quantized checkpoint—some formats include extra metadata or fp16 weights that vLLM loads by default. We’ve had success stripping those out with: from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('your_quantized_checkpoint', device_map='auto', torch_dtype=torch.int8)
model.save_pretrained('cleaned_quant')If you still hit OOM, try the most aggressive quant (int4 with GPTQ). On 2xH100s, we’ve managed up to Llama2-70B int4, but Llama4’s size might push the limits even further. Let me know what quant method and checkpoint format you're using—sometimes the culprit is actually hidden extra fp16 weights. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone,
Has anyone had success running Llama4 with something like
fp8orexperts_int8quantization?Currently 4 is a little bit much, and 3 is impossible due to Llama's architecture, so I've been experimenting with different quant types to be able to run on 2xH100 (160GB total) GPUs, but to no success - running into CUDA out of memory.
I thought int8 would be sufficient since it should result in roughly a halved parameter size (i.e. 110GB VRAM required), but it didn't work.
Beta Was this translation helpful? Give feedback.
All reactions