Large all reduce call before optimizer #7593
Unanswered
shubhagr-qc
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I am trying to do training on 4 GPUs via Megatron-DeepSpeed repo using script: examples_deepspeed/finetune_hf_llama with TP=4, PP=1
Now I am profiling communication calls and I notice that for huggyllama/llama-7b just before optimizer is called there is an allreduce call of around 3.14 GB which seems to be the total weight gradients computed in the preceding forward/backward stages. I am using ZeRO = 0, and disabled all other parallelism except TP=4.
Can someone help in understanding the requirement of that all reduce call as I expected that during backward each GPU is computing gradient for its own share of weights and update those weights only without the need to all reduce the gradients to other GPUs.
Beta Was this translation helpful? Give feedback.
All reactions