AT CUDA Memory Leak When Loading Model

rtkaratekid · October 22, 2019, 9:34pm

I haven’t been able to find any other posts with the same error as mine, so I thought I’d post this. In the case that it’s truly a bug I have no problem posting onto github as well.

I have a C++ program that loads a saved model (that was trained in python) and then runs as a daemon. It actually works fine unless compiled with the address sanitizer. The offending issue is a memory leak that GCC says is in “/pytorch/aten/src/THC/THCGeneral.cpp:50” which is the GPU device check:

THCudaCheck(cudaGetDeviceCount(&numDevices));

Here is the error that crashes the program when it is compiled with ASAN on:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=2 : out of memory

With the above message you’d think that the memory cache would just need to be cleared from the GPUs, but I’ve tried every trick I know to do so (all the way to straight rebooting the machine) and none have worked.

Here is the ASAN message as well:

=================================================================
==12594==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 280 byte(s) in 1 object(s) allocated from:
    #0 0x7f61e2649b50 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xdeb50)
    #1 0x7f61914ffe8f  (<unknown module>)

Direct leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x7f61e2649d38 in __interceptor_calloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xded38)
    #1 0x7f619151e02c  (<unknown module>)

Direct leak of 64 byte(s) in 1 object(s) allocated from:
    #0 0x7f61e2649d38 in __interceptor_calloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xded38)
    #1 0x7f619151e091  (<unknown module>)

Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7f61e2649d38 in __interceptor_calloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xded38)
    #1 0x7f619b9e2258 in at::cuda::detail::CUDAHooks::initCUDA() const (/home/rtkaratekid/DeepLearning/pytorch/deployment/libtorch/lib/libtorch.so+0x421b258)
    #2 0x7fff52cfc1ff  (<unknown module>)

SUMMARY: AddressSanitizer: 512 byte(s) leaked in 4 allocation(s).

The code that creates the error is really simple:

torch::NoGradGuard no_grad_guard;
torch::jit::script::Module module;

try {
      module = torch::jit::load("traced.pt");
} catch (std::exception& e) {
      std::cerr << e.what() << std::endl;
      return -1;
}

If it makes difference, the model is trained in python with DataParallel and I save it during a validation step of the training process like so:

m = model.module
traced = torch.jit.trace(m, example, check_trace=False)
torch.jit.save(traced, "saved_models/test_model.pt")

I could see an argument for trying to move the model to the cpu when saving it, and then back to the gpu for the next training epoch so as to avoid this issue altogether when loading into the C++. But I haven’t been able to accomplish that for some reason.

Also, if this is a legitimate bug in libtorch, I thought it would be good to bring up

albanD · October 22, 2019, 9:58pm

Hi,

Is your cpp program working when it’s not compiled with ASAN?
Are you sure someone else is not using the GPU and already allocates almost all the memory there? The cuda runtime for pytorch will need a few hundreds of MB to initialize on the first cuda call.
I’m not sure pytorch can work with ASAN cc @ezyang ?

rtkaratekid · October 22, 2019, 10:04pm

@albanD Thanks for the quick reply! The program is working when not compiled with ASAN, yes. As for the GPUs, I am the only person with access to my box and I always have the nvidia monitor on to ensure that I’m not overheating them and not overusing the memory.

I’d be really interested to know if pytorch can’t run with ASAN on. That would be a bummer because I really like to compile with different flags and sanitizers just to get a feel for the health of my program. Of course that’s not how I compile for production, just for testing.

rtkaratekid · October 22, 2019, 10:23pm

I changed how I was saving the model in python to it being saved in a CPU state, that way it wouldn’t be GPU dependent in the C++.

m = model.module
m.to(cpu_device)  # move to cpu
traced = torch.jit.trace(m, example, check_trace=False)
torch.jit.save(traced, "saved_models/test_model.pt") 
m.to(device) # move back to gpu

This seems to not affect the training of my model, and actually does serve as a workaround for the ASAN issue. However, I still think the memory leak is something to be looked at if it was previously unknown?

albanD · October 22, 2019, 10:38pm

Interesting, I’m sure ezyang will know whether is was already known or not.
Thanks for the report !

ezyang · October 23, 2019, 1:53pm

ASAN doesn’t work with CUDA, it’s a pretty well known problem. As you note, if you can just use CPU only functionality, you’ll be fine.

rtkaratekid · October 24, 2019, 2:15pm

@ezyang good to know, thanks for the reply