We have been recently tracking some memory leak on our machine. I created a very simple script which just does this:
a = []
for i in range(100):
t = torch.ones(10,1000,1000)
if cuda:
t = t.cuda()
a.append(t)
and I use a flag to control whether .cuda() is used or not. It seems that the CUDA part is causing some kind of memory leak. In the following data, the used started at 4.8G but they were dead memory that did not show up as used by any process in TOP. After running the cuda version of the loop a few times, 300M more memory were dead. I wonder if anyone has any idea as to what is happening? I am using RedHat 4.4.2, Python 2.7.13 with the newest PyTorch, CUDA 7.5.17.
test$ free -h
total used free shared buffers cached
Mem: 23G 6.0G 17G 9.1M 360M 919M
-/+ buffers/cache: 4.8G 18G
Swap: 82G 0B 82G
test$ python test.py cpu
test$ python test.py cpu
test$ python test.py cpu
test$ python test.py cpu
test$ python test.py cpu
test$ free -h
total used free shared buffers cached
Mem: 23G 6.0G 17G 9.1M 360M 919M
-/+ buffers/cache: 4.8G 18G
Swap: 82G 0B 82G
test$ python test.py cuda
test$ python test.py cuda
test$ python test.py cuda
test$ python test.py cuda
test$ python test.py cuda
test$ free -h
total used free shared buffers cached
Mem: 23G 6.4G 17G 9.1M 360M 919M
-/+ buffers/cache: 5.2G 18G
Swap: 82G 0B 82G