Libtorch crash in win10

zhang-qiang-github · December 13, 2021, 6:29am

I want to use the trained model in C++ with libtorch.

However, it failed when I want to load a gpu model

using torch::jit::script::Module;
Module module = torch::jit::load("resnetGPU.pt", torch::kCUDA);

The gpu model is trained in linux, and the pytorch version is: 1.9.0+cu111, and the libtorch version is libtorch-win-shared-with-deps-debug-1.9.0+cu111.
The c++ project is built in win10, and the vs version is vs2022.

The code to generate the model is:

import torch
import torchvision.models as models
model = models.resnet50(pretrained=True).cuda()
model=model.eval()
resnet = torch.jit.trace(model, torch.rand(1, 3, 224, 224).cuda())
resnet.save('resnetGPU.pt')

Any suggestion is appreciated~~

JosephC · December 13, 2021, 11:06pm

It’s a little hard to tell what’s going on without the error logs. Could you copy/paste the error message here?

My best guess is that the cuda/torch DLL files aren’t getting found when the executable runs. Make sure they’re on your system path.

zhang-qiang-github · December 14, 2021, 2:32am

@JosephC Thank you very much for your reply.

It only show crash message as following:

How to check whether the cuda/torch dll has been found?

I build the project by cmake, and I search the cuda in the cmake by:

find_package(CUDA REQUIRED)

And the cmake can find the cuda: CUDA_TOOLKIT_ROOT_DIR.

Do you have any other suggestion?

JosephC · December 14, 2021, 2:45am

Some deeper digging seems to suggest that this might be related to this issue: Libtorch: Segmentation fault when running torch::jit::load · Issue #49460 · pytorch/pytorch · GitHub

If we don’t want to assume that’s the source of the problem, the following are guesses:

CMake and find_package will only find the shared libraries during the build. PyTorch + CUDA also needs to be able to find the DLL files. I can’t think any simple ways I would recommend. You could copy the DLLs into the directory with the executable, but they’re huge.

Another thing to check would be if you’re building in DEBUG or RELEASE mode. If you’re building in release mode but have debug DLLs from PyTorch installed, that can cause issues. I ran into that once.

zhang-qiang-github · December 14, 2021, 9:05am

I have copied all DLLs in C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1 to the directory with the executable. But it still crash.

I am sure I build the project in DEBUG, and I set the Torch_DIR as the debug version of libtorch.

The model is generated in linux, and my c++ project is built in win10. Does this cause the crash?

JosephC · December 14, 2021, 9:13pm

Not clear. As an aside, are you building with MinGW or MSVC? The libraries they produce are not compatible with each other. I don’t know if that’s a problem with DLLs, but if there are statically linked libraries getting used, that might be part of it.

What C++ debugging tools do you have available? It would be nice to step through the code or at least capture more than just an abort().

zhang-qiang-github · December 15, 2021, 1:40am

I am using vs2022 to build and debug the project. Since I have only the lib of libtorch, I can not step into it.

Please have a look at the above figure. Would it be helpful to locate the error?

In addition, I found another bug:

torch::Tensor tensor = torch::rand({5, 3}, torch::kCUDA)

It also crash. But torch::rand({5, 3}) works.

ptrblck · December 16, 2021, 7:25am

I don’t know what kind of naming the .dlls use on Windows, but wouldn’t torch_cpu.dll indicate you are using a CPU-only build?

zhang-qiang-github · December 20, 2021, 9:41am

The torch_cpu.dll also make me confused. But, I am sure I have provided the correct Torch_DIR:

Is this Torch_DIR correct?

Is there any other method to locate the bug?