About the running errors with CUDA in C++

Chen0729 · December 14, 2020, 2:33pm

Hi, All

I have a running error with CUDA in C++ libtorch as follows:
terminate called after throwing an instance of ‘c10::Error’
what(): Tensor for argument #3 ‘mat2’ is on CPU, but expected it to be on GPU (while checking arguments for addmm)
Exception raised from checkSameGPU at /pytorch/aten/src/ATen/TensorUtils.cpp:122 (most recent call first):
It seems to me that it is a conflict between GPU and CPU. But I am not sure exactly what is the conflict and how to correct it.
The related code is:
torch::Tensor forward(int64_t batch_size, bool cuda = false){
torch::Tensor x=torch::autograd::Variable(torch::rand({batch_size,z_dim}));
if(cuda)
x = x.cuda();
x=torch::nn::functional::softplus(bn1(fc1(x))+bn1_b);

and before that I have a line related to CPU:
I have a line like:
return loss.data().cpu();

Shall I modify this line to other ways ? Any comments please ?

Thanks

glaringlee · December 15, 2020, 4:25am

@Chen0729
This might not be conflict. This is caused by mixed CPU tensor (storage is in main memory) and GPU tensor (storage is on GPU) together in a function.
Check your code, I guess there should be somewhere that you passed both CPU and GPU tensors together to a function.

Chen0729 · December 15, 2020, 3:54pm

Thanks so the line that triggers the error is at:
“x=torch::nn::functional::softplus(bn1(fc1(x))+bn1_b);”
x is defined as
torch::Tensor x=torch::autograd::Variable(torch::rand({batch_size,z_dim}));

The code does not allow me to use

    "assert(x.device().type()==torch::kCUDA);"

So in the debugger, how shall I check if x is CPU tensor or GPU tensor please ? Are there any other tensors that I need to check ? Thanks.

bn1, fc1 and bn1_b are defined as:

GeneratorImpl(int64_t z_dim, int64_t output_dim){

    fc1 = register_module("fc1", torch::nn::Linear(torch::nn::LinearOptions(z_dim, 500).bias(false)));

    bn1 = register_module("bn1", torch::nn::BatchNorm1d(torch::nn::BatchNorm1dOptions(500).eps(1e-6).momentum(0.5).affine(false)));
  
    fc2 = register_module("fc2", torch::nn::Linear(torch::nn::LinearOptions(500, 500).bias(false)));
  
    bn2 = register_module("bn2", torch::nn::BatchNorm1d(torch::nn::BatchNorm1dOptions(500).eps(1e-6).momentum(0.5).affine(false)));
 
    fc3 = register_module("fc3",LinearWeightNorm(500, output_dim, 1));
    bn1_b = register_parameter("bn1_b",torch::zeros(500));
    bn2_b = register_parameter("bn2_b",torch::zeros(500));
    torch::nn::init::xavier_uniform_(fc1->weight);
    torch::nn::init::xavier_uniform_(fc2->weight);
};

glaringlee · December 17, 2020, 3:31pm

@Chen0729
I guess the problem is here:
bn1_b = register_parameter(“bn1_b”,torch::zeros(500));
bn2_b = register_parameter(“bn2_b”,torch::zeros(500));
Both zero tensor are CPU tensors.

x was converted to gpu but bn1_b is not.
x=torch::nn::functional::softplus(bn1(fc1(x))+bn1_b)

And I am wondering what is the error message when you do assert? assert should be doable.

Chen0729 · December 22, 2020, 2:14am

Thanks for your comment.

Really appreciate that.

Best,