Num_workers>0 creates memory error in SLURM?

Vedant_Sanil · October 18, 2019, 8:23pm

Hi,

I’m trying to run my code on a SLURM cluster with the following configuration - 1 node, 2 Nvidia 1080Ti GPUs, 8 CPUs and 8GBs of RAM per CPU.

I’m implementing ResNeXt, with a dataset that contains about 1million 32x32 images. When I try running this code with torchvision.datasets.ImageFolder and num_workers = 4-8, it throws an “exceeded virtual memory” error by requesting for 341GB of data! That seems a little absurd. This error is thrown at the first trainLoader loop as it is preparing the first batch for the code.

Initially, I assumed this was an error with my program, but my program works just fine with num_workers = 8 on Google Colab. My program only works when I set num_workers=0. At num_workers=2, it works for 2 epochs before throwing the same error. Any solution for this would really be appreciated.

pritamdamania87 · October 21, 2019, 9:55pm

Are you using DistributedDataParallel or DataParallel for this? It seems like your question is more related to torchvision or the PyTorch dataset/dataloader (the ‘distributed’ tag is not appropriate for this). Maybe tag this with ‘vision’?