Debugging runtime error module->forward(inputs) libtorch 1.4

I have a question related to this project https://github.com/NathanUA/U-2-Net/blob/7e5ff7d4c3becfefbb6e3d55916f48c7f7f5858d/u2net_test.py#L104

I can trace the net like this:

traced_script_module = torch.jit.trace(net, inputs_test)
traced_script_module.save("traced_model.pt")
print(inputs_test.size()) # shows (1, 3, 320, 320)

Now I’m trying to run the model in a C++ application. I was able to do this in a prior project https://github.com/DBraun/PyTorchTOP-cpumem I used CMake and built in debug mode by doing
SET DEBUG=1 before the CMake instructions.

In the C++ project for U-2-Net, I can load the model into a module with no errors. When I call

torchinputs.clear();
torchinputs.push_back(torch::ones({1, 3, 320, 320 }, torch::kCUDA).to(at::kFloat));
module.forward(torchinputs); // error

I get

Unhandled exception at 0x00007FFFD8FFA799 in TouchDesigner.exe: Microsoft C++ exception: std::runtime_error at memory location 0x000000EA677F1B30. occurred

The error is at https://github.com/pytorch/pytorch/blob/4c0bf93a0e61c32fd0432d8e9b6deb302ca90f1e/torch/csrc/jit/api/module.h#L112 It says inputs has size 0. However, I’m pretty sure I’ve passed non-empty data (1, 3, 320,320) to module->forward() https://github.com/DBraun/PyTorchTOP-cpumem/blob/f7cd16cb84021a7fc3681cad3a66c2bd7551a572/src/PyTorchTOP.cpp#L294

This is the stack trace at module->forward(torchinputs)

I thought it might be a DLL issue but I’ve copied all DLLs from libtorch/lib

I can confirm GPU stuff is available and that when I traced the module I was using CUDA.

LoadLibraryA("c10_cuda.dll");
LoadLibraryA("torch_cuda.dll");

try {
	std::cout << "CUDA:   " << torch::cuda::is_available() << std::endl;
	std::cout << "CUDNN:  " << torch::cuda::cudnn_is_available() << std::endl;
	std::cout << "GPU(s): " << torch::cuda::device_count() << std::endl;
}
catch (std::exception& ex) {
	std::cout << ex.what() << std::endl;
}

Trying to fix the runtime exception on module->forward, I thought maybe @torch.jit.script needed to be in some of the functions in the U-2-Net project like here https://github.com/NathanUA/U-2-Net/blob/7e5ff7d4c3becfefbb6e3d55916f48c7f7f5858d/model/u2net.py#L24 I was worried about calling shape[2:] in a function without the @torch.jit.script Should I not be worried?

Any advice is appreciated!

I’ve also followed all the instructions here An unhandled exceptionMicrosoft C ++ exception: c10 :: Error at memory location

Have you moved your model to CUDA? The model will be on CPU by default if you call torch::jit::load.

Thanks for your suggestion. I tried

module = torch::jit::load("traced_model.pt", torch::kCUDA);
module.to(torch::kCUDA);

but got the same results. I have the debug dlls and library etc, perfectly ready for some more debugging. Anything more I can do to help?

I’m stepping through line-by-line. I noticed that the module.forward() call takes about 18 seconds before the exception and this happens even when I know I’m giving it a wrongly sized Tensor:

torchinputs.push_back(torch::ones({1, 1, 1, 1}, torch::kCUDA).to(torch::kFloat)); // intentionally wrong size
module.forward(torchinputs);

If I change everything in my code to cpu, it doesn’t throw a runtime error. So I must be not succeeding in making sure everything is CUDA. I also tried following everything here https://github.com/pytorch/pytorch/issues/19302

Why isn’t this sufficient for having everything in CUDA?

auto module = torch::jit::load("traced_model.pt", torch::kCUDA);
for (auto p : module.parameters()) {
	std::cout << p.device() << std::endl; // cuda:0
}

auto finalinput = torch::ones({ 1, 3, 320, 320 }, torch::TensorOptions().dtype(torch::kFloat).device(torch::kCUDA));
std::cout << "finalinput device: " << finalinput.device() << std::endl; // cuda:0
torchinputs.push_back(finalinput);
auto forward_result = module.forward(torchinputs); // std::runtime_error

^ and changing merely both of those two references to kCPU instead of kCUDA doesn’t throw an error.

1 Like

I read everything here https://pytorch.org/tutorials/advanced/cpp_export.html and tried at::kCUDA instead of torch::kCUDA. I tried the nightly debug 1.5 libtorch but encountered other problems that I couldn’t solve, so I need to stick with 1.4 for now.

same issue with you. I use unet for inference.(libtorch1.5.0 cuda9.2)

Hi , I was facing the same issues (trying to run in Unreal Engine 4.25.

Did you manage to solve this?

Not yet. Maybe the info on the github issue will help you although it didn’t work for me https://github.com/NathanUA/U-2-Net/issues/29

Putting LoadLibraryA("torch_cuda.dll"); early in my code allowed me to start using the nightly debug build of libtorch 1.5, but I’m still stuck on module.forward.

I also put -INCLUDE:?warp_size@cuda@at@@YAHXZ in Linker>All Options>Additional Options.

and LoadLibraryA("c10_cuda.dll"); early too. Is there something else I can try?

Update again: I can trace a style transfer model and that works in my code here, but the traced U2Net model doesn’t work.

Below is how I wrapped the model. Is it ok to use x[:,0,:,:] or does that break the jit trace? I’m also concerned about this line https://github.com/NathanUA/U-2-Net/blob/b77cd6da3204efcb03e18e15dd3b9eb24d47f969/model/u2net.py#L24

def normPRED(d):
    ma = torch.max(d)
    mi = torch.min(d)

    dn = (d-mi)/(ma-mi)

    return dn


class ModelWrapper(nn.Module):

    def __init__(self, u2netmodel):
            super(ModelWrapper,self).__init__()

            self.u2netmodel = u2netmodel

    def forward(self, x):

            # my code doesn't use ToTensorLab in the data loader
            # https://github.com/NathanUA/U-2-Net/blob/b77cd6da3204efcb03e18e15dd3b9eb24d47f969/data_loader.py#L208
            # so do the normalization in this wrapper

            x = x / torch.max(x)

            r = (x[:,0,:,:]-0.485)/0.229
            g = (x[:,1,:,:]-0.456)/0.224
            b = (x[:,2,:,:]-0.406)/0.225

            img = torch.stack((r,g,b), 1)

            d1,d2,d3,d4,d5,d6,d7= self.u2netmodel(img)

            return normPRED(d1)

# paraphrasing u2net_test.py
net = U2NET(3,1)
net.cuda()
wrapper = ModelWrapper(net)
wrapper.cuda()
wrapper.to('cuda') # just in case?
print('is cuda: ' + str(next(wrapper.parameters()).is_cuda)) # True

inputs_test = data_test['image']
inputs_test = inputs_test.type(torch.FloatTensor)
inputs_test = inputs_test.cuda()
print("inputs size: " + str(inputs_test.size())) # [1, 3, 320, 320]

d1 = wrapper(inputs_test)

# save d1 as an image and the image is great!
# save_output(img_name_list[i_test],pred,prediction_dir)

traced = torch.jit.trace(wrapper, inputs_test)
traced.save("traced_model.pt")

If I use torch.jit.script instead of torch.jit.trace:

sm = torch.jit.script(wrapper)
torch.jit.save(sm, "traced_model.pt")

I realized I could put print statements and that they would show up in the c++ console. I put some surrounding the execution of torch.stack in my ModelWrapper. It turns out this is the moment it throws the exception. Same thing happens when trying to write it with torch.cat followed by torch.unsqueeze (the cat fails).

Weirdly I am able to do this inside the ModuleWrapper’s forward:
something = torch.stack([torch.randn([2, 3, 4]), torch.randn([2, 3, 4])])

When I removed the torch.stack call from ModelWrapper, using print statements I was able to pinpoint a failure within the u2netmodel forward on a call to cat. https://github.com/NathanUA/U-2-Net/blob/b77cd6da3204efcb03e18e15dd3b9eb24d47f969/model/u2net.py#L87

So my question now is why can’t I do torch.stack or torch.cat in this module? Is it because these things allocate memory whereas calls to Conv2D don’t allocate? I’ve used print statements to make sure everything’s on cuda etc. Something about cloning? Python execution of the same model is working fine.

I realized I should just try the following code:

auto thing1 = torch::ones({ 1, 3, 5, 5 }, torch::kCUDA).to(torch::kFloat32);
auto thing2 = torch::ones({ 1, 3, 5, 5 }, torch::kCUDA).to(torch::kFloat32);
auto thing3 = torch::cat({ thing1, thing2 }, 1);

I get this error on torch::cat

Unhandled exception at 0x00007FFA2C4FA799 in TouchDesigner.exe: Microsoft C++ exception: c10::Error at memory location 0x00000046B7DF54D0.
at this line:

Note that I’m running this code inside a DLL compiled for TouchDesigner and TouchDesigner is using CUDA 10.1 to match my libtorch build. If I run the same code inside a simple exe, there’s no error. Any clue what’s going on or clue to proceed? Thank you!

It was an issue with DLLs…

TouchDesigner has its own DLLs in C:/Program Files/Derivative/TouchDesigner/bin. These DLLs get loaded when TouchDesigner opens.

My custom plugin is in Documents/Derivative/Plugins and all of the libtorch DLLs are also there. My thought was that having everything in this Plugins folder would be sufficient: If my custom dll looked for a dependent DLL it would find it there as a sibling. However, I needed to paste the libtorch DLLs into TouchDesigner’s own bin folder. I didn’t trace down which specific DLL was the dealbreaker. Whichever one is most relevant to torch::cat I guess…