Gradient not accumulated across nodes in deepspeed code

ahhyun · July 8, 2025, 7:27am

Hello, I’m trying to make a deepspeed version of a code that worked without deepspeed and see if the results can be replicated in deepspeed version. However, it seems our code is not working properly and hence wanted to ask for guidance.

TLDR:

Cannot access gradient from model_engine (only from the model itself which model_engine wraps)
Gradient does not seem to be aggregated across gpus after .backward (i.e. when we run with 4 gpus with batch size of 32 each, the batch size of the input is 32 while the batch size of the gradient is also 32)
The validation (eval mode) loss for the same data is the same for all ranks except rank 0.

We would greatly appreciate if anyone could give us even a slight hint into debugging this.

Below is the a detailed explanation.

Background

Deepspeed version and no deepspeed version code exists separately.

deepspeed version code: pretrain_main_deepspeed.py, pretrain_trainer_deepspeed.py
no deepspeed version code : pretrain_main.py, pretrain_trainer.py

Experimental conditions are

IDENTICAL : global batch size, lr, wd, similar optimizer(AdamW , FusedAdam)
DIFFERENT : number of gpus
=> We expect them to give same training curves.

I found out that 1node 4GPU version(deepspeed) seem to work fine based on validation loss, following the loss curve of 1node 4gpu(torch’s Data Parallel). (Orange : deepspeed, Blue: no deepspeed)

However, in 16 node condition, loss doesn’t converge to expected level. I’ve tried variants of below.

(a) same effective batch size : batch_size 2 (effective batch size = 2 * 16Nodes * 4GPUs = 128)
(b) batch_size(same batch_size per GPU : 128 / batch_size per GPU : 32) X lr (optimal lr in 1node condition * N / *sqrt(N))

Implementation details

I’ve attached the script I’ve used.
Note) train_micro_batch_size_per_gpu is not used in the actual code, argument batch_size is the actual batch size fed into dataloader.

    data_loader = DataLoader(
        train_dataset, batch_size=params.batch_size, # This is per-GPU batch size
        sampler=train_sampler,
        num_workers=16, shuffle=(train_sampler is None), # Shuffle is True if not using DDP sampler
        persistent_workers=True,
        prefetch_factor=3, pin_memory=True, drop_last=True, collate_fn=collate_fn_for_data_info
    )

Click to download code used

Problem & Questions

So I’m guessing there’s fundamental problem with my deepspeed code.
I’ve printed out below to see if gradient is accumulated across ranks.
There were three problems

Cannot inspect gradient from model_engine
Gradient shape not as expected
Validation loss with fake data gave different loss only for rank 0.

Questions

How to inspect gradient across ranks?
I’ve tried two versions as below. model printed out gradients while model_engine didn’t. (2node 4GPU)

def debug_gradients(self, batch_idx, log_every=10):
     """Quick gradient debugging"""
     if batch_idx % log_every != 0:
         return

 rank = deepspeed.comm.get_rank() if self.params.deepspeed else 8888
 print(f"\n[Rank {rank}] Gradient Check - Batch {batch_idx}")
 
 print("✅MODEL VERSION")
 for name, param in self.model.named_parameters():
     grad = deepspeed.utils.safe_get_full_grad(param)
     if grad is not None:
         print("Gradient Shape: ", name, grad.shape)
     else:
         print("Gradient is None for parameter:", name)
         
 print("✅MODEL ENGINE VERSION")       
 for name, param in self.model_engine.named_parameters():
     if param.grad is not None:
         grad_norm = param.grad.norm().item()
         print(f"  {name}: shape={param.grad.shape}, norm={grad_norm:.4f}")
     else:
         print(f"  {name}: No gradient")`


x.shape:  torch.Size([32, 19, 30, 500])

✅MODEL VERSION
Gradient is None for parameter: mask_encoding
Gradient Shape:  embedding.0.proj_in.0.weight torch.Size([32, 1, 1, 63])
Gradient Shape:  embedding.0.proj_in.0.bias torch.Size([32])
Gradient Shape:  embedding.0.proj_in.1.weight torch.Size([32])
Gradient Shape:  embedding.0.proj_in.1.bias torch.Size([32])
Gradient Shape:  embedding.0.proj_in.3.weight torch.Size([32, 32, 1, 3])

✅MODEL ENGINE VERSION
  module.mask_encoding: No gradient
  module.embedding.0.proj_in.0.weight: No gradient
  module.embedding.0.proj_in.0.bias: No gradient
  module.embedding.0.proj_in.1.weight: No gradient
  module.embedding.0.proj_in.1.bias: No gradient

Also, input shape, gradient shape weren’t as I expected.
(1node 4GPU)
Since batch_size is 32, I think gradient shape should be (128, -,-,-) since gradient should be shared across model_engine. How should I target this?
I’ve checked validation loss with fake data (same shape with input x, torch.ones and torch.zeros) and found validation loss differ only in rank 0. (2node 4GPU)

def validate(self, epoch, normalize_factor=100.0):
        self.model_engine.eval()
        valid_losses_rank = [] # Losses on this rank

        # tqdm only on rank 0
        iterable_valid_loader = self.valid_data_loader
        if self.is_rank_0:
            iterable_valid_loader = tqdm(self.valid_data_loader, desc=f"Validation Epoch {epoch}", mininterval=10)

        with torch.no_grad():
            for batch_idx, (x, data_info_list) in enumerate(iterable_valid_loader):
                x = x.to(self.device, dtype=torch.float32) / 100.0
                print("validation x.shape: ", x.shape)
                
                ##!
                fake_data = torch.ones_like(x, device=self.device, dtype=torch.float32)/2 # Create (same) fake data across ranks
                fake_loss = self.SSL.compute_loss(fake_data, data_info_list=data_info_list)
                print(f"Fake loss for validation for rank {deepspeed.comm.get_rank()}: {fake_loss.item()}") # Print fake loss for debugging
                ##!

Fake loss for validation for rank 0: 0.8215165138244629
Fake loss for validation for rank 4: 0.8218275308609009
Fake loss for validation for rank 5: 0.8218275308609009
Fake loss for validation for rank 7: 0.8218275308609009
Fake loss for validation for rank 6: 0.8218275308609009
Fake loss for validation for rank 2: 0.8218275308609009
Fake loss for validation for rank 1: 0.8218275308609009
Fake loss for validation for rank 3: 0.8218275308609009