Hello, I’m trying to make a deepspeed version of a code that worked without deepspeed and see if the results can be replicated in deepspeed version. However, it seems our code is not working properly and hence wanted to ask for guidance.
TLDR:
- Cannot access gradient from model_engine (only from the model itself which model_engine wraps)
- Gradient does not seem to be aggregated across gpus after
.backward
(i.e. when we run with 4 gpus with batch size of 32 each, the batch size of the input is 32 while the batch size of the gradient is also 32) - The validation (eval mode) loss for the same data is the same for all ranks except rank 0.
We would greatly appreciate if anyone could give us even a slight hint into debugging this.
Below is the a detailed explanation.
Background
Deepspeed version and no deepspeed version code exists separately.
- deepspeed version code:
pretrain_main_deepspeed.py
,pretrain_trainer_deepspeed.py
- no deepspeed version code :
pretrain_main.py
,pretrain_trainer.py
Experimental conditions are
- IDENTICAL : global batch size, lr, wd, similar optimizer(AdamW , FusedAdam)
- DIFFERENT : number of gpus
=> We expect them to give same training curves.
I found out that 1node 4GPU version(deepspeed) seem to work fine based on validation loss, following the loss curve of 1node 4gpu(torch’s Data Parallel). (Orange : deepspeed, Blue: no deepspeed)
However, in 16 node condition, loss doesn’t converge to expected level. I’ve tried variants of below.
- (a) same effective batch size : batch_size 2 (effective batch size = 2 * 16Nodes * 4GPUs = 128)
- (b) batch_size(same batch_size per GPU : 128 / batch_size per GPU : 32) X lr (optimal lr in 1node condition * N / *sqrt(N))
Implementation details
I’ve attached the script I’ve used.
Note) train_micro_batch_size_per_gpu
is not used in the actual code, argument batch_size
is the actual batch size fed into dataloader.
data_loader = DataLoader(
train_dataset, batch_size=params.batch_size, # This is per-GPU batch size
sampler=train_sampler,
num_workers=16, shuffle=(train_sampler is None), # Shuffle is True if not using DDP sampler
persistent_workers=True,
prefetch_factor=3, pin_memory=True, drop_last=True, collate_fn=collate_fn_for_data_info
)
Problem & Questions
So I’m guessing there’s fundamental problem with my deepspeed code.
I’ve printed out below to see if gradient is accumulated across ranks.
There were three problems
- Cannot inspect gradient from
model_engine
- Gradient shape not as expected
- Validation loss with fake data gave different loss only for rank 0.
Questions
-
How to inspect gradient across ranks?
I’ve tried two versions as below.model
printed out gradients whilemodel_engine
didn’t. (2node 4GPU)def debug_gradients(self, batch_idx, log_every=10): """Quick gradient debugging""" if batch_idx % log_every != 0: return rank = deepspeed.comm.get_rank() if self.params.deepspeed else 8888 print(f"\n[Rank {rank}] Gradient Check - Batch {batch_idx}") print("✅MODEL VERSION") for name, param in self.model.named_parameters(): grad = deepspeed.utils.safe_get_full_grad(param) if grad is not None: print("Gradient Shape: ", name, grad.shape) else: print("Gradient is None for parameter:", name) print("✅MODEL ENGINE VERSION") for name, param in self.model_engine.named_parameters(): if param.grad is not None: grad_norm = param.grad.norm().item() print(f" {name}: shape={param.grad.shape}, norm={grad_norm:.4f}") else: print(f" {name}: No gradient")`
x.shape: torch.Size([32, 19, 30, 500])
✅MODEL VERSION
Gradient is None for parameter: mask_encoding
Gradient Shape: embedding.0.proj_in.0.weight torch.Size([32, 1, 1, 63])
Gradient Shape: embedding.0.proj_in.0.bias torch.Size([32])
Gradient Shape: embedding.0.proj_in.1.weight torch.Size([32])
Gradient Shape: embedding.0.proj_in.1.bias torch.Size([32])
Gradient Shape: embedding.0.proj_in.3.weight torch.Size([32, 32, 1, 3])
✅MODEL ENGINE VERSION
module.mask_encoding: No gradient
module.embedding.0.proj_in.0.weight: No gradient
module.embedding.0.proj_in.0.bias: No gradient
module.embedding.0.proj_in.1.weight: No gradient
module.embedding.0.proj_in.1.bias: No gradient
-
Also, input shape, gradient shape weren’t as I expected.
(1node 4GPU)
Since batch_size is 32, I think gradient shape should be (128, -,-,-) since gradient should be shared across model_engine. How should I target this? -
I’ve checked validation loss with fake data (same shape with input x,
torch.ones
andtorch.zeros
) and found validation loss differ only in rank 0. (2node 4GPU)
def validate(self, epoch, normalize_factor=100.0):
self.model_engine.eval()
valid_losses_rank = [] # Losses on this rank
# tqdm only on rank 0
iterable_valid_loader = self.valid_data_loader
if self.is_rank_0:
iterable_valid_loader = tqdm(self.valid_data_loader, desc=f"Validation Epoch {epoch}", mininterval=10)
with torch.no_grad():
for batch_idx, (x, data_info_list) in enumerate(iterable_valid_loader):
x = x.to(self.device, dtype=torch.float32) / 100.0
print("validation x.shape: ", x.shape)
##!
fake_data = torch.ones_like(x, device=self.device, dtype=torch.float32)/2 # Create (same) fake data across ranks
fake_loss = self.SSL.compute_loss(fake_data, data_info_list=data_info_list)
print(f"Fake loss for validation for rank {deepspeed.comm.get_rank()}: {fake_loss.item()}") # Print fake loss for debugging
##!
Fake loss for validation for rank 0: 0.8215165138244629
Fake loss for validation for rank 4: 0.8218275308609009
Fake loss for validation for rank 5: 0.8218275308609009
Fake loss for validation for rank 7: 0.8218275308609009
Fake loss for validation for rank 6: 0.8218275308609009
Fake loss for validation for rank 2: 0.8218275308609009
Fake loss for validation for rank 1: 0.8218275308609009
Fake loss for validation for rank 3: 0.8218275308609009