Hi,
I’m using torch.nn.DataParallel
to do single-node data parallelism , and I’m wondering the following: how should the DataLoader batch be scaled?
I’m asking since I have a code running fine with batch 16 on a T4 GPU, but doing CUDA OOM with batch 416 = 64 (and even with 48!) with torch.nn.DataParallel
over 4x T4 GPUs. Is torch.nn.DataParallel doing anything weird with the memory, so that it has less memory available than N1 GPU memory? or is torch.nn.DataParallel
already applying a scaling rule so that the dataloader batch is the per-GPU batch and not the SGD-level batch? (I don’t think that’s the case as the doc says it “splits the input across the specified devices by chunking in the batch dimension”)
note:I know PyTorch recommends DDP even for single-node data parallel, but honestly I’m not smart enough to figure out how to use all those torchrun/torch.distributed/launch.py tool, MPI, local_rank things and couldn’t make DDP work after a week and 7 issues opened