Model running in CPU instead of GPU when module->to(torch::kCUDA) with C++ API

Hi, I am trying to run a pytorch trained model following the instructions here: https://pytorch.org/tutorials/advanced/cpp_export.html

I want to run the model in the GPU if I run:

    torch::DeviceType device_type;
    if (torch::cuda::is_available()) {
        std::cout << "Cuda available, running on GPU" << std::endl;
        device_type = torch::kCUDA;
    } else {
        std::cout << "Cuda NOT available, running on CPU" << std::endl;
        device_type = torch::kCPU;
    }
    torch::Device device(device_type);
    module->to(torch::Device(device));

I get the message: Cuda available, running on GPU. Nevertheless, when I execute:
at::Tensor output = module->forward(inputs).toTensor();
The command consumes way to much time and I can see at the system monitor how all cores are being used and how no new GPU process is opened in the nvidia-smi tool.
I guess I am doing something wrong but I can’t find what. Any ideas?

I am using Cuda 9.0 and built the project from libcudnn7_7.4.2.24-1
I imported:

#include <torch/script.h> 
#include <torch/cuda.h>

Thanks a lot and sorry for the inconvenience.

This seems weird. Can you try and see if module->to(torch::Device(torch::kCUDA, 0)) works?

Sorry, I discovered that before tracing the model in python I didn’t move it to cuda. If I move it, trace it, and reload it in c++, it correctly runs on the GPU.

Still I can’t copy a cv::Mat into a torch::Tensor and move it to the GPU

at::Tensor output = torch::from_blob(img.data, {1, 3, img.rows, img.cols}, torch::kFloat32).clone();
output.to(torch::Device(torch::kCUDA, 0));
output.to(torch::kCUDA); //trying both ways just in case
assert(output.device().type() == torch::kCUDA); // Assertation failed

After, I tried to copy the data manually like so:

auto options = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA, 0);
at::Tensor output = torch::rand({1, 3, img.rows, img.cols}, options); // maybe cols rows
auto foo_a = output.accessor<float, 4>(); // define accessor    
for(int i = 0; i < img.rows; i++){
    for(int j = 0; j < img.cols; j++){
        foo_a[0][0][i][j] = img.at<cv::Vec3f>(i, j)[0]; //Segmentation Fault here when accessing foo_a
        foo_a[0][1][i][j] = img.at<cv::Vec3f>(i, j)[1];
        foo_a[0][2][i][j] = img.at<cv::Vec3f>(i, j)[2];
    }
}
assert(output.device().type() == torch::kCUDA);

but I get a segmentation fault (line with the comment).

Any ideas on what could be causing these errors?

First I’d recommend using torch::Tensor instead of at::Tensor, as at::Tensor is now an implementation detail and torch::Tensor is the user-facing tensor class.

For the first example, we can try output = output.to(torch::Device(torch::kCUDA, 0)); to see if it works.

For the second example, this will work:

auto options = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA, 0);
auto output = torch::rand({1, 3, img.rows, img.cols}, options); // maybe cols rows
for(int i = 0; i < img.rows; i++){
    for(int j = 0; j < img.cols; j++){
        output[0][0][i][j] = img.at<cv::Vec3f>(i, j)[0];
        output[0][1][i][j] = img.at<cv::Vec3f>(i, j)[1];
        output[0][2][i][j] = img.at<cv::Vec3f>(i, j)[2];
    }
}
assert(output.device().type() == torch::kCUDA);

Thank you a lot Will Feng! Both solutions work!!

Is it normal that running the forward method on the GPU consumes more time than using the CPU? I wrote a simplified version of my code and this still occurs:

#include <chrono>
#include <ctime>
#include <iostream>

#include <torch/cuda.h>
#include <torch/script.h>

bool use_cuda = true;

class Timer{
  private:
    std::chrono::system_clock::time_point t_start;

  public:
    void start(){
        t_start = std::chrono::system_clock::now();
    }

    void stop(std::string msg){
        std::chrono::system_clock::time_point end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = end - t_start;
        std::cout << msg << ": " << elapsed_seconds.count() << "s\n";
    }
};

int main(int argc, const char* argv[]) {
    if (argc != 2) {
        std::cerr << "usage: example-app <path-to-exported-script-module>\n";
        return -1;
    }

    // Deserialize model
    std::shared_ptr<torch::jit::script::Module> module = torch::jit::load(argv[1]);
    assert(module != nullptr);

    // Set CPU/GPU
    torch::DeviceType device_type = torch::kCPU;
    if(use_cuda) device_type = torch::kCUDA;
    module->to(device_type);

    // Create input
    auto options = torch::TensorOptions().dtype(torch::kFloat32).device(device_type);
    auto inp = torch::zeros({2, 3, 720, 1280}, options);
    std::vector<torch::jit::IValue> inputs;
    inputs.push_back(inp);

    // Run model
    Timer t;
    t.start();
    torch::Tensor output = module->forward(inputs).toTensor(); 
    t.stop("forward");
}

Time in CPU: ~3s
Time in GPU: ~7s!!!

I’m surely missing something, I am using a Nvidia GTX 1080 and the model can be evaluated under 0.1s in Python.

Again, thank you a lot for your help!!

No problem. For the GPU slowness, is the JIT model originally trained with GPU?

Yes.
I use a pre-trained ResNet101 with custom layers at the end that are trained in GPU.

Do you mind sharing your JIT model so I can reproduce?

Hi Will, sorry for the late reply. Turns out that the first time I run forward it takes ~7s, but if I run it in a loop all the others take ~0.09s (still not as fast as running it from Python which is ~0.02s, but way more decent)

Any idea on what may be causing this? I guess I could just do a warm-up at the beginning and run it once but I am curious.

warmup is necessary.

I try to move tensor from cpu to gpu, the first time costs a lot time, but the next times it went fast as I wish.

Hello,Could you please tell me how to warm up?Thank you.

Hello,I don’t know how to warm up,could you please tell me ? Thank you.

Just inference once after you load the model

Hi. I have a question. How to move the data to CPU?

module.forward(inputs).to(torch::kCPU)

I tried this but failed. Do you have any ideas?