Large all reduce call before optimizer #7593

shubhagr-qc · 2025-09-25T05:02:08Z

shubhagr-qc
Sep 25, 2025

Hello everyone,
I am trying to do training on 4 GPUs via Megatron-DeepSpeed repo using script: examples_deepspeed/finetune_hf_llama with TP=4, PP=1
Now I am profiling communication calls and I notice that for huggyllama/llama-7b just before optimizer is called there is an allreduce call of around 3.14 GB which seems to be the total weight gradients computed in the preceding forward/backward stages. I am using ZeRO = 0, and disabled all other parallelism except TP=4.
Can someone help in understanding the requirement of that all reduce call as I expected that during backward each GPU is computing gradient for its own share of weights and update those weights only without the need to all reduce the gradients to other GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large all reduce call before optimizer #7593

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Large all reduce call before optimizer #7593

Uh oh!

shubhagr-qc Sep 25, 2025

Replies: 0 comments

shubhagr-qc
Sep 25, 2025