- Start from Hugging Face pretrained model checkpoint with on the fly conversion.
- Support all kinds of model parallelism (TP, EP, ETP, PP).
- Export to TensorRT-LLM, vLLM, and SGLang ready unified checkpoint.
| Model | Quantization | EAGLE3 | Q-LoRA | Pruning (PP only) | Distillation |
|---|---|---|---|---|---|
moonshotai/Kimi-K2-Instruct |
✅ | Online | |||
Qwen/Qwen3-{30B-A3B, 235B-A22B} |
WAR | Online | |||
Qwen/Qwen3-{0.6B, 8B} |
✅ | Online | ✅ | ✅ | |
deepseek-ai/DeepSeek-R1 |
✅ | Online | |||
meta-llama/Llama-{3.1-8B, 3.1-405B, 3.2-1B}-Instruct |
✅ | Online | ✅ | ✅ |
Given that only megatron.core can be pip-install, the examples are containerized with
Megatron-LM and
TensorRT Model Optimizer
pre-installed. Use the following command to build the container.
docker build --no-cache --network=host --rm -t nvidia-modelopt-megatron:latest .📙 NOTE: If you plan to use
slurmfor multi-node execution, push the image to a container registry.
For local execution, a READ/WRITE scratch space needs to be mounted. Mount additional volumes for checkpoints, datasets, and other artifact.
USER_FSW=<path_to_scratch_space> bash interactive.sh📙 NOTE: The current dir will be mounted to
$(pwd):/workspace/nmm-sandboxand the scratch space will be mounted to/workspace/scratch.
Provide the pretrained checkpoint path through variable ${HF_MODEL_CKPT}:
\
TP=1 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Instruct-FP8 \
bash megatron-lm/examples/post_training/modelopt/quantize.sh meta-llama/Llama-3.2-1B-Instruct fp8
\
PP=1 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_LOAD=/tmp/Llama-3.2-1B-Instruct-FP8 \
EXPORT_DIR=/tmp/Llama-3.2-1B-Instruct-Export \
bash megatron-lm/examples/post_training/modelopt/export.sh meta-llama/Llama-3.2-1B-Instruct
You can find a resumable Megatron-LM checkpoint for quantization-aware training or simulated evaluation
(/tmp/Llama-3.2-1B-Instruct-FP8) and a Hugging Face-Like exported checkpoint for
deployment (/tmp/Llama-3.2-1B-Instruct-Export).
Online EAGLE3 training has both the target (frozen) and draft models in the memory where the hidden_states
required for training is generated on the fly.
\
TP=1 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Eagle3 \
bash megatron-lm/examples/post_training/modelopt/eagle3.sh meta-llama/Llama-3.2-1B-Instruct
\
PP=1 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_LOAD=/tmp/Llama-3.2-1B-Eagle3 \
EXPORT_DIR=/tmp/Llama-3.2-1B-Eagle3-Export \
bash megatron-lm/examples/post_training/modelopt/export.sh meta-llama/Llama-3.2-1B-InstructPeriodically, acceptance length (AL) is evaluated on MT-Bench prompts. You can find resumable
Megatron-LM checkpoint (/tmp/Llama-3.2-1B-Eagle3) and a Hugging Face-Like exported checkpoiint
for deployment (/tmp/Llama-3.2-1B-Eagle3-Export).
See ADVANCED.md for a multi-gpu multi-node training example for moonshotai/Kimi-K2-Instruct.
Coming soon ...
Checkout pruning getting started section and guidelines for configuring pruning parameters in the pruning README.
Pruning is supported for GPT and Mamba models in Pipeline Parallel mode. Available pruning options are:
TARGET_FFN_HIDDEN_SIZETARGET_HIDDEN_SIZETARGET_NUM_ATTENTION_HEADSTARGET_NUM_QUERY_GROUPSTARGET_MAMBA_NUM_HEADSTARGET_MAMBA_HEAD_DIMTARGET_NUM_LAYERSLAYERS_TO_DROP(comma separated, 1-indexed list of layer numbers to directly drop)
Example for depth pruning Qwen3-8B from 36 to 24 layers:
PP=1 \
TARGET_NUM_LAYERS=24 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_SAVE=Qwen3-8B-Pruned \
bash megatron-lm/examples/post_training/modelopt/prune.sh qwen/Qwen3-8BTip
If number of layers in the model is not divisible by pipeline parallel size (PP), you can configure uneven
PP by setting MLM_EXTRA_ARGS="--decoder-first-pipeline-num-layers <X> --decoder-last-pipeline-num-layers <Y>"
For simplicity, we use shell scripts and variables as arguments. Each script has at least 1 positional
argument [pretrained_model_card]. Some scripts may require more such as [qformat] is needed for
quantization.
\
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
bash megatron-lm/examples/post_training/modelopt/quantize.sh [pretrained_model_card] [qformat]❗ IMPORTANT:
pretrained_model_cardCANNOT be a path to a local pretrained checkpoint. It is used to get the corresponding Megatron-LM${MODEL_ARGS}. For example,meta-llama/Llama-3.1-8B-Instructordeepseek-ai/DeepSeek-R1are both supported.
Provide the pretrained checkpoint through variable${HF_MODEL_CKPT}in commandline or inenv_setup_template.sh. More variables (e.g.${TP},${EP}, ...) can be provided though commandline but we recommend passing all variable in a anothershellscript.
When ${HF_MODEL_CKPT} is not set through the commandline, ./env_setup_template.sh can be used
to pass all variables instead. If you have your own script, use ${SANDBOX_ENV_SETUP}.
\
SANDBOX_ENV_SETUP=<path_to_your_script> \
bash megatron-lm/examples/post_training/modelopt/quantize.sh [pretrained_model_card] [qformat]If you use our slurm script, then you MUST USE ${SANDBOX_ENV_SETUP} (default: ./env_setup_template.sh).
Other variables are not passing through sbatch and srun automatically.
See ADVANCED.md to learn all the configurable variables.