Is it possible to execute the script module in C++ on multiple GPUs in parallel?

Hi, everyone!

I have a PyTorch model generated by torch.jit.trace().save(). According to the information I found online, it didn’t seem to support torch.nn.parallel when the model is generated and saved.
Then, I converted it to TorchScript, loaded, and finally executed the script module in C++.
The whole program is used to detect objects in a video.

But I am not satisfied enough with its runtime (represented by FPS). Since I have two GPUs, I am thinking if it’s possible to execute on multiple GPUs (for example, load 32 frames from the video and let each GPU handle 16). But still, it’s impossible to load the script module directly to multiple GPUs using torch::jit::load(modelPath.c_str()). In this way, I am curious about if it’s possible to use torch::nn::parallel.

Could anyone please give me a clue on how to use torch::nn::parallel in my case? If it’s not possible, are there any other possible ways of executing the script module in C++ on multiple GPUs in parallel? Can I only turn to multiple-GPU programming with CUDA?

Thanks a lot for any possible help!

Hi,

Doesn’t nn::parallel::DataParallel work just like the python version (doc here) ?
You can wrap your module into it after loading.

Hi, thanks a lot for your reply! But according to this post How to trace jit in evaluation mode using multi-gpu learned model?, if my model is generated in Python using torch.jit.trace().save(), will it still work to load the model and parallelize it in C++? (Maybe I mixed those concepts…)

Furthermore, I tried so but keep encountering data type errors, may I ask if the following is what you mean?

torch::jit::Module model = torch::jit::load(modelPath.c_str());
input_tensor = input_tensor.to(at::kCUDA).permute({0, 3, 1, 2}).contiguous();
torch::Tensor outputs = torch::nn::parallel::data_parallel(model, input_tensor);

But I encountered the error saying that

libtorch/include/torch/csrc/api/include/torch/nn/parallel/data_parallel.h:267:5: error: base operand of ‘->’ has non-pointer type ‘torch::jit::Module’
     module->to(devices->front());
     ^~~~~~
libtorch/include/torch/csrc/api/include/torch/nn/parallel/data_parallel.h:269:12: error: base operand of ‘->’ has non-pointer type ‘torch::jit::Module’
     return module->forward(std::move(input)).to(*output_device);
            ^~~~~~
libtorch/include/torch/csrc/api/include/torch/nn/parallel/data_parallel.h:279:28: error: no matching function for call to ‘replicate(torch::jit::Module&, std::vector<c10::Device>&)’
   auto replicas = replicate(module, *devices);
                   ~~~~~~~~~^~~~~~~~~~~~~~~~~~
libtorch/include/torch/csrc/api/include/torch/nn/parallel/data_parallel.h:142:42: note: candidate: template<class ModuleType> std::vector<std::shared_ptr<_Tp> > torch::nn::parallel::replicate(const std::shared_ptr<_Tp>&, const std::vector<c10::Device>&)
 std::vector<std::shared_ptr<ModuleType>> replicate(
                                          ^~~~~~~~~
libtorch/include/torch/csrc/api/include/torch/nn/parallel/data_parallel.h:142:42: note:   template argument deduction/substitution failed:
libtorch/include/torch/csrc/api/include/torch/nn/parallel/data_parallel.h:279:28: note:   ‘torch::jit::Module’ is not derived from ‘const std::shared_ptr<_Tp>’
   auto replicas = replicate(module, *devices);
                   ~~~~~~~~~^~~~~~~~~~~~~~~~~~
libtorch/include/torch/csrc/api/include/torch/nn/parallel/data_parallel.h:161:39: note: candidate: template<class ModuleType> std::vector<torch::nn::ModuleHolder<ModuleType> > torch::nn::parallel::replicate(const torch::nn::ModuleHolder<ModuleType>&, const std::vector<c10::Device>&)
 std::vector<ModuleHolder<ModuleType>> replicate(
                                       ^~~~~~~~~
libtorch/include/torch/csrc/api/include/torch/nn/parallel/data_parallel.h:161:39: note:   template argument deduction/substitution failed:
libtorch/include/torch/csrc/api/include/torch/nn/parallel/data_parallel.h:279:28: note:   ‘torch::jit::Module’ is not derived from ‘const torch::nn::ModuleHolder<ModuleType>’
   auto replicas = replicate(module, *devices);
                   ~~~~~~~~~^~~~~~~~~~~~~~~~~~

I tried to use _net = std::make_shared<torch::jit::Module>(model); or torch::Tensor outputs = torch::nn::parallel::data_parallel(&model, input_tensor); but none of them work.

Do I misunderstand what you said?

Hi,

if my model is generated in Python using torch.jit.trace().save(), will it still work to load the model and parallelize it in C++

The difference with this other post is that you don’t save the DataParallel mode, you want to apply it afterward.

Do I misunderstand what you said?

No that was my suggestion. But it looks like the jit::Module is not working fine to replace a nn::Module here :confused:
Could you open an issue on github about that please?

Sure, thanks a lot for your quick response!