Potential memory leak

Lifeng_Jin · March 2, 2017, 2:53am

We have been recently tracking some memory leak on our machine. I created a very simple script which just does this:

a = []
for i in range(100):
    t = torch.ones(10,1000,1000)
        if cuda:
            t = t.cuda()
    a.append(t)

and I use a flag to control whether .cuda() is used or not. It seems that the CUDA part is causing some kind of memory leak. In the following data, the used started at 4.8G but they were dead memory that did not show up as used by any process in TOP. After running the cuda version of the loop a few times, 300M more memory were dead. I wonder if anyone has any idea as to what is happening? I am using RedHat 4.4.2, Python 2.7.13 with the newest PyTorch, CUDA 7.5.17.

test$ free -h
             total       used       free     shared    buffers     cached
Mem:           23G       6.0G        17G       9.1M       360M       919M
-/+ buffers/cache:       4.8G        18G
Swap:          82G         0B        82G
test$ python test.py cpu
test$ python test.py cpu
test$ python test.py cpu
test$ python test.py cpu
test$ python test.py cpu
test$ free -h
             total       used       free     shared    buffers     cached
Mem:           23G       6.0G        17G       9.1M       360M       919M
-/+ buffers/cache:       4.8G        18G
Swap:          82G         0B        82G
test$ python test.py cuda
test$ python test.py cuda
test$ python test.py cuda
test$ python test.py cuda
test$ python test.py cuda
test$ free -h
             total       used       free     shared    buffers     cached
Mem:           23G       6.4G        17G       9.1M       360M       919M
-/+ buffers/cache:       5.2G        18G
Swap:          82G         0B        82G

smth · March 2, 2017, 3:31am

Please understand that the kernel does not always have to free memory unless it thinks it needs to, it might be caching some pages for various reasons. This is not a memory leak, unless you literally cannot reclaim memory (i.e. if you try to allocate 18G + 0.4GB, and allocations fail).

Lifeng_Jin · March 2, 2017, 3:35am

I do understand that. This used memory is never claimed by the kernel anytime, even if it goes into swap mode because it runs out of memory when a memory intensive program is run.

smth · March 2, 2017, 3:37am

I see. If you tried to actually test memory allocations to system limits and they seem to fail, maybe it’s a CUDA bug. At this point the best thing to try is probably upgrade CUDA versions. I cant think of anything else.
My previous answer was from seeing this behavior on some of my systems and then realizing that the kernel was just caching some pages.

Lifeng_Jin · March 2, 2017, 3:42am

OK. Thank you for your help. I will upgrade CUDA and test this again.

trdavidson · June 13, 2017, 7:44am

@Lifeng_Jin - did the CUDA upgrade work in your case?

Lifeng_Jin · June 20, 2017, 5:04pm

It did. We did a very trivial test creating a lot of tensors either on the GPU or on the CPU and found that it was definitely the GPU. The update solved it.