I haven’t been able to find any other posts with the same error as mine, so I thought I’d post this. In the case that it’s truly a bug I have no problem posting onto github as well.
I have a C++ program that loads a saved model (that was trained in python) and then runs as a daemon. It actually works fine unless compiled with the address sanitizer. The offending issue is a memory leak that GCC says is in “/pytorch/aten/src/THC/THCGeneral.cpp:50” which is the GPU device check:
THCudaCheck(cudaGetDeviceCount(&numDevices));
Here is the error that crashes the program when it is compiled with ASAN on:
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=2 : out of memory
With the above message you’d think that the memory cache would just need to be cleared from the GPUs, but I’ve tried every trick I know to do so (all the way to straight rebooting the machine) and none have worked.
Here is the ASAN message as well:
=================================================================
==12594==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 280 byte(s) in 1 object(s) allocated from:
#0 0x7f61e2649b50 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xdeb50)
#1 0x7f61914ffe8f (<unknown module>)
Direct leak of 128 byte(s) in 1 object(s) allocated from:
#0 0x7f61e2649d38 in __interceptor_calloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xded38)
#1 0x7f619151e02c (<unknown module>)
Direct leak of 64 byte(s) in 1 object(s) allocated from:
#0 0x7f61e2649d38 in __interceptor_calloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xded38)
#1 0x7f619151e091 (<unknown module>)
Direct leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7f61e2649d38 in __interceptor_calloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xded38)
#1 0x7f619b9e2258 in at::cuda::detail::CUDAHooks::initCUDA() const (/home/rtkaratekid/DeepLearning/pytorch/deployment/libtorch/lib/libtorch.so+0x421b258)
#2 0x7fff52cfc1ff (<unknown module>)
SUMMARY: AddressSanitizer: 512 byte(s) leaked in 4 allocation(s).
The code that creates the error is really simple:
torch::NoGradGuard no_grad_guard;
torch::jit::script::Module module;
try {
module = torch::jit::load("traced.pt");
} catch (std::exception& e) {
std::cerr << e.what() << std::endl;
return -1;
}
If it makes difference, the model is trained in python with DataParallel and I save it during a validation step of the training process like so:
m = model.module
traced = torch.jit.trace(m, example, check_trace=False)
torch.jit.save(traced, "saved_models/test_model.pt")
I could see an argument for trying to move the model to the cpu when saving it, and then back to the gpu for the next training epoch so as to avoid this issue altogether when loading into the C++. But I haven’t been able to accomplish that for some reason.
Also, if this is a legitimate bug in libtorch, I thought it would be good to bring up