Num_workers>0 creates memory error in SLURM?


I’m trying to run my code on a SLURM cluster with the following configuration - 1 node, 2 Nvidia 1080Ti GPUs, 8 CPUs and 8GBs of RAM per CPU.

I’m implementing ResNeXt, with a dataset that contains about 1million 32x32 images. When I try running this code with torchvision.datasets.ImageFolder and num_workers = 4-8, it throws an “exceeded virtual memory” error by requesting for 341GB of data! That seems a little absurd. This error is thrown at the first trainLoader loop as it is preparing the first batch for the code.

Initially, I assumed this was an error with my program, but my program works just fine with num_workers = 8 on Google Colab. My program only works when I set num_workers=0. At num_workers=2, it works for 2 epochs before throwing the same error. Any solution for this would really be appreciated.

Are you using DistributedDataParallel or DataParallel for this? It seems like your question is more related to torchvision or the PyTorch dataset/dataloader (the ‘distributed’ tag is not appropriate for this). Maybe tag this with ‘vision’?