Training crashes due to - Insufficient shared memory (shm) - nn.DataParallel

maqy · September 14, 2020, 9:33am

I met the same issue in pytorch 1.5.1, and set a new value in /etc/sysctl.conf is not work(the default value of kernel.shmmax is enough = 18446744073692774399).

This time, I use df -h command and found there is a disk named /dev/shm ( shm seems like shared memory, which value is 50% of machine’s memory. Then I remount it by:

mount -o size=yourMemorySize -o nr_inodes=1000000 -o noatime, nodiratime -o remount /dev/shm

This problem is fixed.

By the way, the os I used is CenOS7.