How to use multi gpu inference in libtorch?

I want to use libtorch for multi gpu inference, is there any example or tutorial?
Should I clone multi jit::script::Module and move them to different gpu?

If I’m not mistaken torch::nn::parallel::data_parallel would be the equivalent to nn.DataParallel in the Python frontend and in case you would like to use DistributedDataParallel feel free to add your use case in this poll.

I want to use multi gpu manually, because the input data size is different.

for (int i = 0; i < param_.gpu_size(); ++i) {
    torch::Device device(torch::kCUDA, i);
    models.push_back(torch::jit::load(MODEL_PATH, device));
    models.back()->eval();
}

And I will create gpu_size threads to run the inference.
Is it the correct way?

I used 2 threads to run the model on 2 gpus, like:

      std::vector<std::thread*> threads(0);
      int parallel = std::min(param_.gpu_size(),
                              static_cast<int>(batched_samples.size()));
      batches.resize(parallel);
      for (int i = 0; i < parallel; ++i) {
          batches[i].resize(0);
          for (int j = i; j < betched_samples.size(); j += parallel) {
              batches[i].push_back(std::move(batched_samples[j]));
          }
          LOG(ERROR) << "==== predict level-" << l << " batch-"
                     << batches[i].size();

          threads.push_back(new std::thread(
                  &predict_torch, models_[i],
                  std::ref(batches[i]), param_.gpu(i));
      }
      for (int i = 0; i < threads.size(); ++i) {
          threads[i]->join();
          delete threads[i];
      }

But I got the:

terminate called after throwing an instance of 'c10::Error'
  what():  r INTERNAL ASSERT FAILED at "../aten/src/ATen/core/jit_type_base.h":172, please report a bug to PyTorch. 
Exception raised from expect at ../aten/src/ATen/core/jit_type_base.h:172 (most recent call first):
frame #0: <unknown function> + 0x101f9b (0x7fa6e14e6f9b in /root/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x4c (0x7fa6e14e7e20 in /root/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x40 (0x7fa6e14e6030 in /root/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: std::shared_ptr<c10::ClassType> c10::Type::expect<c10::ClassType>() + 0xbb (0x7fa6c2ae3b1b in /root/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10::ivalue::Object::type() const + 0x41 (0x7fa6c2ad10f1 in /root/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x5ec04cf (0x7fa6c69c14cf in /root/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::jit::Object::find_method(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const + 0x37 (0x7fa6c69d48c1 in /root/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.s
o)
frame #7: torch::jit::Object::get_method(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const + 0x50 (0x7fa6d7179414 in /root/ecopia-weaver-multi-scale-framework/src/build/libecopia_weaver.
so)
frame #8: torch::jit::Module::forward(std::vector<c10::IValue, std::allocator<c10::IValue> >) + 0xba (0x7fa6d717979a in /root/ecopia-weaver-multi-scale-framework/src/build/libecopia_weaver.so)
frame #9: ecopia::ml::CaffeForwardMultiScale::predict_torch(torch::jit::Module*, std::vector<std::vector<ecopia::ml::Sample*, std::allocator<ecopia::ml::Sample*> >, std::allocator<std::vector<ecopia::ml::Sample*, std::allocator<ec
opia::ml::Sample*> > > > const&, int, int, int, int, ecopia::ml::LevelInfo const&, std::unordered_map<int, MultiChannelRasterData<float>*, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, MultiChannelRasterD
ata<float>*> > >&, std::unordered_map<int, MultiChannelLabelMap<unsigned char>*, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, MultiChannelLabelMap<unsigned char>*> > >&, std::unordered_map<int, SingleCha
nnelLabelMap<unsigned char>*, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, SingleChannelLabelMap<unsigned char>*> > >&) + 0x139c (0x7fa6d7172224 in /root/ecopia-weaver-multi-scale-framework/src/build/lib
ecopia_weaver.so)

Anyone have ideas?

If I run the model on one of the two gpus, both can work.