Hi,
I’m trying to run my code on a SLURM cluster with the following configuration - 1 node, 2 Nvidia 1080Ti GPUs, 8 CPUs and 8GBs of RAM per CPU.
I’m implementing ResNeXt, with a dataset that contains about 1million 32x32 images. When I try running this code with torchvision.datasets.ImageFolder and num_workers = 4-8, it throws an “exceeded virtual memory” error by requesting for 341GB of data! That seems a little absurd. This error is thrown at the first trainLoader loop as it is preparing the first batch for the code.
Initially, I assumed this was an error with my program, but my program works just fine with num_workers = 8 on Google Colab. My program only works when I set num_workers=0. At num_workers=2, it works for 2 epochs before throwing the same error. Any solution for this would really be appreciated.