PyTorch: 1.1.0 (both built from github and d/l from pip)
Python: 3.6.8 and 3.7.1
System: 92GB RAM, 1 to 4 Tesla V100s.
Issue: Using the default forking method for torch.multiprocessing, num_workers > 1 for DataLoader, and /proc/sys/vm/ overcommit_memory=2 results in “Cannot allocate memory” error on fork.
The strange thing is that my dataset could fit into memory 100 times without using it up, and indeed less than 2GB of memory are taken up when running my code. This issue came up when I tried running my code on a node of a distributed cluster with the hardware mentioned above, after running perfectly fine on my 32GB/Titan X system. I had the system administrators confirm the issue was with the overcommit_memory settings by flipping them off for me yesterday.
There are several ways for us to get around it (changing the overcommit ratio, increasing the swap, or changing the overcommit_memory value), but I’m still confused as to why it seems Python / Pytorch is requesting so much memory that it isn’t actually using in the first place, and if there is something I can change in my code that would resolve the issue without server-side changes.
Enabling torch.multiprocessing.set_start_method(“spawn”) or “forkserver” prevents the memory calculation error / os.fork() crash, but also slows my code down for a crawl (even and especially when trying to run on multigpu with DataParallel), so I’m assuming whichever solution method is preferred will be one that doesn’t involve these settings.