Torchscript frcnn model with nms in python and load in C++ libtorch

daroo · September 17, 2021, 3:28pm

I’ve been trying to get a FRCNN model trained in Python to be “torchscripted” and saved off so that it can be loaded into C++ using libtorch. To reduce the situation to a fairly simple example, I’m now just trying to use torchvision’s pretrained model to get my custom model out of the picture. I’m still running into problems when loading the scripted model in libtorch complaining about nms being an “Unknown builtin op”.

Here’s the reduced python code:

import torch
import torchvision

print(f'Torch version: {torch.__version__}')
print(f'Torchvision version: {torchvision.__version__}')

tvfrcnnModel = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained = True)
scriptTvfrcnnModel = torch.jit.script(tvfrcnnModel)
torch.jit.save(scriptTvfrcnnModel, 'torchvision_frcnn_scripted.pt')

print(f'Torchvision pretrained frcnn scripted and saved')

that outputs the following:

Torch version: 1.10.0a0+gitf69cf3c
Torchvision version: 0.11.0a0+9275cc6
Torchvision pretrained frcnn scripted and saved

Here’s the C++ code, then, that attempts to load the scripted model:

//According to: https://github.com/pytorch/vision/#c-api
//  In order to get the torchvision operators registered with
//  torch (eg. for the JIT), all you need to do is to ensure
//  that you #include <torchvision/vision.h> in your project.
#include <torchvision/vision.h>
#include <torch/torch.h>
#include <torch/script.h>
#include <iostream>
#include <fstream>
#include <string>

int main()
{
  string scriptedModelFname = "torchvision_frcnn_scripted.pt";
  ifstream scriptedModelF;
  scriptedModelF.open(scriptedModelFname);
  torch::jit::script::Module tvfcrnnModel;
  tvfcrnnModel = torch::jit::load(scriptedModelF);
  scriptedModelF.close();

  return 0;
}

When I run this program, though, it reports:

terminate called after throwing an instance of 'torch::jit::ErrorReport'
  what():  
Unknown builtin op: torchvision::nms.
Could not find any similar ops to torchvision::nms. This op may not exist or may not be currently supported in TorchScript.
:
  File "/usr/local/lib64/python3.6/site-packages/torchvision-0.11.0a0+9275cc6-py3.6-linux-x86_64.egg/torchvision/ops/boxes.py", line 35
    """
    _assert_has_ops()
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
           ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
...

I’ve seen a small handful of similar questions out there (e.g. Unknown builtin op: torchvision::nms when loading scripted FasterRCNN · Issue #48932 · pytorch/pytorch · GitHub) but I haven’t found any that seem to come to a resolution. Is torchscripting a FRCNN model with nm something that is supposed to be supported currently, and if so, are there are any suggestions as to what might be going wrong here?

Thanks!

tom · September 17, 2021, 6:04pm

But isn’t the solution given there - the link to GitHub - pytorch/vision: Datasets, Transforms and Models specific to Computer Vision ?

daroo · September 20, 2021, 11:27am

Well, I don’t think so. I mean, I believe I’ve followed the steps in that link. To be sure, I went through it again and made a new project directory with minimal code. I still get the same error:

terminate called after throwing an instance of 'torch::jit::ErrorReport'
  what():  
Unknown builtin op: torchvision::nms.
Could not find any similar ops to torchvision::nms. This op may not exist or may not be currently supported in TorchScript.
:
  File "/usr/local/lib64/python3.6/site-packages/torchvision-0.11.0a0+a2b4c65-py3.6-linux-x86_64.egg/torchvision/ops/boxes.py", line 35
    """
    _assert_has_ops()
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
           ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

My CMakeLists.txt looks like this:

cmake_minimum_required(VERSION 3.10)
project(testTorch)

# The first thing do is to tell cmake to find the TorchVision library.
# The package pulls in all the necessary torch libraries,
# so there is no need to also add `find_package(Torch)` here.
find_package(TorchVision REQUIRED)

add_executable(testTorch main.cpp)

include_directories(testTorch ${OpenCV_INCLUDE_DIRS})

# We now need to link the TorchVision library to our executable.
# We can do that by using the TorchVision::TorchVision target,
# which also adds all the necessary torch dependencies.
target_compile_features(testTorch PUBLIC cxx_range_for)
target_link_libraries(testTorch TorchVision::TorchVision)
set_property(TARGET testTorch PROPERTY CXX_STANDARD 14)

I hope I’m not missing something obvious here, I tried many things before posting the question. Any help would be appreciated…

tom · September 20, 2021, 12:24pm

I’ll have to check, but you probably tried using PUBLIC for linking torchvision, right?

daroo · September 20, 2021, 12:42pm

I have tried it now. Didn’t seem to make a difference - same error. Thanks for the suggestion!

When building pytorch, I didn’t specify “-DSELECTED_OP_LIST”, because I saw a post indicating that the default was to include all ops and that you only needed SELECTED_OP_LIST if you wanted to restrict it to a subset of the ops.

pav · September 20, 2021, 1:02pm

I have similar problem! So in the end did you manage to sort it out? How did you specify the operators with -DSELECTED_OP_LIST @daroo ? Could you please post some commands / cmake config files?

daroo · September 20, 2021, 2:42pm

No, I haven’t got past this yet. as indicated in the previous post, I didn’t end up specifying -DSELECTED_OP_LIST since I deemed it not necessary given the posts I read.

pav · September 20, 2021, 3:45pm

@daroo its so weird. When I try to load scripted model I get the same error. When I try to register the ops manually with something like:

‘’‘static auto registry = torch::RegisterOperators().op(“torchvision::nms”, &vision::ops::nms);’’’

I get:

‘’‘libc++abi: terminating with uncaught exception of type c10::Error: Tried to register an operator (torchvision::nms(Tensor _0, Tensor _1, float _2) → (Tensor _0)) with the same name and overload name multiple times. Each overload’s schema should only be registered with a single call to def(). Duplicate registration: registered by RegisterOperators. Original registration: registered at /tmp/torchvision-20210615-98076-1ldvglh/vision-0.10.0/torchvision/csrc/ops/nms.cpp:18’’’

daroo · September 20, 2021, 4:59pm

Hmm, that is odd. I have not been playing much with trying to register the ops myself because it sounded like they would have been registered just by doing “#include <torchvision/vision.h>”, which I have included in the .cpp file. The way I had read to do that was to dump out a yaml file from python with your network’s ops specified in it, and then, when building pytorch, specifying that pre-made yaml file with “-DSELECTED_OP_LIST”, something like “-DSELECTED_OP_LIST:STRING=myNetOps.yaml”. But again, I didn’t mess with that much since I read it wasn’t necessary unless you were trying to restrict things for space reasons or something…

I’m holding out hope that @tom will have some great insight on this.

pav · September 23, 2021, 11:03am

Hey @daroo @tom did you manage to solve this? I tried like million different ways of registering those operators without any luck. The docs are stating that all you have to do is to use “#include <torchvision/vision.h>” - just like @daroo mentioned. But this simply doesn’t work… The only thing I was able to do in order to fix it short term was to clone the previous version (0.8.0) build it myself and link against it…

daroo · September 23, 2021, 11:32am

@pav Just this morning, I think I made some progress. I noticed that, despite the docs indicating #include’ing the vision.h file being all that is needed, my formed executable was not linked with torchvision, even though it was specified on the linker command line. As this link: About RegisterOperators in C++ · Issue #2134 · pytorch/vision · GitHub discusses, the linker will recognize that a library isn’t needed and not link with it at all. That’s what user bmanga’s suggested update at the bottom of that post is supposed to overcome.

That clever way of forcing torchvision to be linked in even if it doesn’t detect any use of the library either wasn’t implemented or else its just not working in my case I think. So I used the big hammer approach by forcing the linker to use the “whole archive” for torchvision. On the link line right before specification of the libtorchvision library, I added “-Wl,–whole-archive” and then right after specification of libtorchvision I added “-Wl,–no-whole-archive”.

Now, I can load the torchvision FRCNN without error!

When I try to load my own custom net I get some other errors, but I think its unrelated to this. So, for now, I think forcing the linker to load the entire torchvision library even if it doesn’t think it is needed seems to get past the issue at hand.

tom · September 23, 2021, 11:58am

Hi,

as discussed in the bug, if your executable isn’t linked to libtorchvision, you are in trouble.
I’m surprised that no-as-needed doesn’t work, but hey.
An alternative could be to just dlopen libtorchvision manually.

Best regards

Thomas

pav · September 23, 2021, 12:59pm

@daroo @tom Ahh I see! Yes you are both right - I’m on mac so don’t have access to “-Wl,–whole-archive”. However, what I ended up using is ‘-Wl,-needed-ltorchvision’ - now all the ops are being registered. Thanks Guys!