Longet training time, lower batch size, and OOM error with distributed data parallel

meshghi · June 30, 2022, 6:07pm

Hello! I’m trying to use DDP instead of DP for my model training. However, DDP forces me to either use a smaller batch size or longer training time. I have a few experiments as follows, and I appreciate some insights:

Longer training time:

Triaining on one node with 8 GPUs: when I use DP and batch size 40, my understanding is that pytorch uses it to send batches of size 5 to each of 8 GPUs. When I use DDP, I specify batch size 5 which is used for each GPU which again sums up to 40. However, the training time with DDP is longer than DP.

Smaller Batch Size:

Triaining on two/three nodes with 8 GPUs each, totally 16/24 GPUs: In this setting, I get CUDA Out of Memory even with batch size 1. The model is UNET-VGG16 and each GPU has 16G memory. My assumption is that this is because of the AllReduce action. And that is why going from 8 GPUs in the previous question above to a multi-node setting causes OOM error because it needs to bring 16/24 model parameters to each GPU for synchronization. I even tried gradients_as_bucket_view=True, but it didn’t fix the issue. My question is that if my reasoning is true, the DDP is not helpful since multi node training is not possible even with the small UNET model. But, if my reasoning is wrong, I would appreciate some help to fix my issue with DDP.

ptrblck · July 1, 2022, 5:28am

Yes, DataParallel will split the batch in dim0 and send the chunks to each device. It will introduce additional overheads via the model copies and should be slower than DDP. Could you share a minimal, executable code snippet showing that DDP is slower than DP, please?

I guess you are accidentally creating CUDA contexts on the default device and it’s running out of memory. Try to use torch.cuda.set_device or mask the devices via CUDA_VISIBLE_DEVICES.

meshghi · July 1, 2022, 7:55pm

Thanks for your reply.

Yes, DataParallel will split the batch in dim0 and send the chunks to each device. It will introduce additional overheads via the model copies and should be slower than DDP. Could you share a minimal, executable code snippet showing that DDP is slower than DP, please?

I have the following code that runs slow on my data. I haven’t tested this on small sample data. Sorry that my code is not executable.

def main(gpu, cfg):
    """
    The gpu from the main function's argument goes from 0 to world_size-1. Therefore, for the local
    rank,  for one node I used cfg.gpu = gpu, but for multi node I use cfg.gpu = gpu // len(cfg.gpus)
    to create gpu local number from 0 to 7 per node as local rank.
    """
    cfg.gpu = gpu # local rank in the case of one node training
    cfg.gpu = gpu // len(cfg.gpus) # local rank in the case of multi node training
    cfg.rank = gpu # global rank

    torch.distributed.init_process_group(backend='nccl',
                                         init_method=cfg.dist_url,
                                         world_size=cfg.worl_size,
                                         rank=cfg.rank)

    torch.cuda.set_device(cfg.gpu)
    torch.distributed.barrier()

    train_dataset = getDataset('train')
    valid_dataset = getDataset('valid')

    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,
                                                                    num_replicas=cfg.world_size,
                                                                    rank=cfg.rank,
                                                                    shuffle=False,
                                                                    drop_last=True)

    train_dataloader = torch.utils.data.DataLoader(train_dataset,
                                                   collate_fn=my_collate,
                                                   worker_init_fn=worker_init_fn,
                                                   sampler=train_sampler)

    valid_dataloader = torch.utils.data.DataLoader(valid_dataset,
                                                   collate_fn=my_collate,
                                                   sampler=None)

    model = Net()
    model.cuda(cfg.gpu)

    model = torch.nn.parallel.DistributedDataParallel(model,
                                                      device_ids=[cfg.gpu],
                                                      output_device=cfg.gpu,
                                                      gradient_as_bucket_view=True)

    optimizer = optimizer()
    criterion = criterion().cuda(cfg.gpu)
    scheduler = scheduler()

    device = torch.device("cuda:{}".format(cfg.gpu))
    runner = Runner(model, 
                    model_device=device, 
                    optimizer, 
                    criterion, 
                    scheduler, 
                    train_dataloader, 
                    valid_dataloader)

    runner.train()

    torch.distributed.destroy_process_group()
 
if __name__== "__main__":
    cfg = getConfig()
    os.environ["MASTER_ADDR"]=hostname
    os.environ["MASTER_PORT]=portNumber
    os.environ["CUDA_VISIBLE_DEVICES"]=",".join(map(str,cfg.gpus)) #cfg.gpus=[0,1,2,3,4,5,6,7] per node
    cfg.["dist_url"]=f'env://{hostname}:{portNumber}'
    torch.multiprocessing.spawn(main, nprocs=cfg.world_size, args=(cfg,))

I guess you are accidentally creating CUDA contexts on the default device and it’s running out of memory. Try to use torch.cuda.set_device or mask the devices via CUDA_VISIBLE_DEVICES.

I am using both of these (shown in the code above). I don’t know what else I should do. I’m stuck.

meshghi · July 12, 2022, 9:45pm

@ptrblck Any suggestion would be great. Thank you!

ptrblck · July 12, 2022, 10:27pm

I can’t really debug the code since it’s not executable, but also note that:

cfg.gpu = gpu // len(cfg.gpus)

will use cuda:0 regardless which value gpu has (assuming it’s in the range of all available GPUs).
so torch.cuda.set_device(cfg.gpu) would use the default device and would fit my suspicion from my previous post.