How to combine gradients from different devices into different weights

Hyeonuk_Woo · January 3, 2023, 1:21am

As far as I understand, in DDP context, when backward is called, gradients from every device are gathered and the averaged gradient(averaged by number of device) is applied to optimizer.step.
Due to the characteristic of my data, a different number of samples are created for deep learning to learn from each batch of data. In other words, when I call backward(), The gradients coming from each device are gradients calculated from different numbers of samples.
When gather the gradients backward, I want to find the average of the gradients according to the number of samples, not the average according to the number of devices.
How can I do this?
Multiplying the gradients coming from each device by different weights (the number of samples in my case) can be a solution?
Below is my code snippet.

def train_epoch(...):
     for i, batch in enumerate(train_dataloader):
        with torch.cuda.amp.autocast(enabled=args.amp):
            pred = model(...)
            loss=loss_fn(...)         
        grad_scaler.scale(loss).backward()
        # gradient accumulation
        if (i + 1) % args.accumulate_grad_batches == 0 or (i + 1) == len(train_dataloader):
            if args.gradient_clip:
                grad_scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), args.gradient_clip)
        grad_scaler.step(optimizer)
        grad_scaler.update()
        optimizer.zero_grad()

MitchellKT · January 3, 2023, 7:16am

An easy workaround would be to use loss_fn which computes the sum of the losses per instance (i.e sum reduction in CE).
Then, multiply this loss with n_gpus/tot_num_samples where tot_num_samples is the total amount of samples and n_gpus is the amount of gpus used.
Then, the final gradients will be of the correct loss.