As far as I understand, in DDP context, when
backward is called, gradients from every device are gathered and the averaged gradient(averaged by number of device) is applied to
Due to the characteristic of my data, a different number of samples are created for deep learning to learn from each batch of data. In other words, when I call
backward(), The gradients coming from each device are gradients calculated from different numbers of samples.
When gather the gradients backward, I want to find the average of the gradients according to the number of samples, not the average according to the number of devices.
How can I do this?
Multiplying the gradients coming from each device by different weights (the number of samples in my case) can be a solution?
Below is my code snippet.
def train_epoch(...): for i, batch in enumerate(train_dataloader): with torch.cuda.amp.autocast(enabled=args.amp): pred = model(...) loss=loss_fn(...) grad_scaler.scale(loss).backward() # gradient accumulation if (i + 1) % args.accumulate_grad_batches == 0 or (i + 1) == len(train_dataloader): if args.gradient_clip: grad_scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), args.gradient_clip) grad_scaler.step(optimizer) grad_scaler.update() optimizer.zero_grad()