As far as I understand, in DDP context, when backward
is called, gradients from every device are gathered and the averaged gradient(averaged by number of device) is applied to optimizer.step
.
Due to the characteristic of my data, a different number of samples are created for deep learning to learn from each batch of data. In other words, when I call backward()
, The gradients coming from each device are gradients calculated from different numbers of samples.
When gather the gradients backward, I want to find the average of the gradients according to the number of samples, not the average according to the number of devices.
How can I do this?
Multiplying the gradients coming from each device by different weights (the number of samples in my case) can be a solution?
Below is my code snippet.
def train_epoch(...):
for i, batch in enumerate(train_dataloader):
with torch.cuda.amp.autocast(enabled=args.amp):
pred = model(...)
loss=loss_fn(...)
grad_scaler.scale(loss).backward()
# gradient accumulation
if (i + 1) % args.accumulate_grad_batches == 0 or (i + 1) == len(train_dataloader):
if args.gradient_clip:
grad_scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), args.gradient_clip)
grad_scaler.step(optimizer)
grad_scaler.update()
optimizer.zero_grad()