(shared) Memory leak on Pytorch 1.0

I’d like to report a weird behaviour in 1.0 that I could resolve by only going back to 0.4. I am training a network on quite a big data set (35 GB) and use 4 gpus by applying the command
torch.nn.DataParallel(model).cuda()
Further I am having a big batch size (>1000) which makes the command
torch.multiprocessing.set_sharing_strategy(‘file_system’). I have num_workers=16 in the dataloader
necessary. Now the trouble begins: every epoch my /dev/shm increases by ca. 3GB. At some point it is full and my process crashes. I tried 1.0.0 and 1.0.1 but both showed this bahaviour. Pytorch 0.4 does not have this problem, /dev/shm is never above 1GB.
Is this a bug?

Looking at the available sharing strategies, the file_system one is clearly prone to leaks. If the data loader ends up allocating new shared tensors for every epoch, this would explain your leak. Did you try using the file_descriptor sharing strategy?

file_descriptor is the default setting. I tried it of course but could not use for other reasons.

If you can (have sudo privilege), increase the file descriptor limit of your system and use the file_descriptor sharing strategy.