(shared) Memory leak on Pytorch 1.0

lev · February 26, 2019, 8:25am

I’d like to report a weird behaviour in 1.0 that I could resolve by only going back to 0.4. I am training a network on quite a big data set (35 GB) and use 4 gpus by applying the command
torch.nn.DataParallel(model).cuda()
Further I am having a big batch size (>1000) which makes the command
torch.multiprocessing.set_sharing_strategy(‘file_system’). I have num_workers=16 in the dataloader
necessary. Now the trouble begins: every epoch my /dev/shm increases by ca. 3GB. At some point it is full and my process crashes. I tried 1.0.0 and 1.0.1 but both showed this bahaviour. Pytorch 0.4 does not have this problem, /dev/shm is never above 1GB.
Is this a bug?

pietern · March 12, 2019, 5:23pm

Looking at the available sharing strategies, the file_system one is clearly prone to leaks. If the data loader ends up allocating new shared tensors for every epoch, this would explain your leak. Did you try using the file_descriptor sharing strategy?

lev · March 13, 2019, 8:54am

file_descriptor is the default setting. I tried it of course but could not use for other reasons.

NeoZ · April 17, 2019, 8:14am

If you can (have sudo privilege), increase the file descriptor limit of your system and use the file_descriptor sharing strategy.