Code runs fine on CPU and GPU but gives seg fault at the end onGPU

neilmehta87 · June 10, 2020, 2:32am

Hi,

I am trying to write a simple NN example in libtorch and I was able to successfully run this code on a CPU. However, after changing torch::Device to kCUDA, I was able to run the code on a GPU but it throws a seg fault error after finishing the run. Could someone please help me?
I am attaching my code below

#include<torch/torch.h>
#include<iostream>
#include<cmath>
#include<cstdio>

//The batch size for training
const int64_t N = 64;
//The input dimension
const int64_t D_in = 1000;
//The hidden dimension
const int64_t H = 100;
//The output dimension
const int64_t D_out = 10;
//Total number of steps
const int64_t tstep = 1000;

using namespace torch;
using namespace std;

struct TwoLayerNetImpl : nn::Module{
  TwoLayerNetImpl(int D_in, int H, int D_out)
      : linear1(nn::LinearOptions(D_in, H).bias(false)),
	linear2(nn::LinearOptions(H, D_out).bias(false))
  {
    register_module("linear1", linear1);
    register_module("linear2", linear2);
  }

  torch::Tensor forward(torch::Tensor x)
  {
    x = torch::clamp_min(linear1(x),0);
    x = linear2(x);
    return x;
  }
  nn::Linear linear1, linear2;
};
TORCH_MODULE(TwoLayerNet);

int main(int argc, char* argv[])
{
  torch::manual_seed(1);
  torch::Device device(torch::kCUDA);      //change to kCPU to run on host
  if (torch::cuda::is_available())
  {
    std::cout << "CUDA is available! Training on GPU" << std::endl;
  }

    torch::Tensor X = torch::randn({N, D_in}).to(device);
    torch::Tensor Y = torch::randn({N, D_out}).to(device);
    
    TwoLayerNet net(D_in,H,D_out);
    net->to(device);
  
    torch::optim::Adam optimizer(net->parameters(), torch::optim::AdamOptions(1e-4));
    torch::nn::MSELoss criterion((torch::nn::MSELossOptions(torch::kSum)));
  
    for(int64_t ts=0;ts<=tstep;++ts)
    {
      torch::Tensor Y_pred = net->forward(X);
      torch::Tensor loss = criterion(Y_pred, Y);
      if(ts % 100 == 0)
      {
        printf("\r[%4ld/%4ld] | D_loss: %e \n", ts,tstep,loss.item<float>());
      }
      optimizer.zero_grad();
      loss.backward();
      optimizer.step();
    }
    std::cout<<"Training complete!"<<std::endl;
   return 0;
}

Thank you!

glaringlee · June 10, 2020, 8:20pm

Can you provide which libtorch version (including cuda version) you are using?
What’s your cuda version?
Can you paste your error msg?

neilmehta87 · June 10, 2020, 8:43pm

Hi,

I am using pytorch/v1.5.0-gpu. I downloaded the stable release of libtorch.
I was using cuda 10.1.243 yesterday and it was giving me a segmentation fault error.
However, after switching to cuda 10.2.89 and recompiling my code, I no longer get the error.
I apologize for the false alarm.

Thank you

glaringlee · June 11, 2020, 3:12am

@neilmehta87

Glad to here that. One possible reason is that the libtorch you downloaded is for cuda 10.2.
We have different libtorch binary for different version of cuda.
You can find them here: https://pytorch.org/

neilmehta87 · June 11, 2020, 5:54am

@glaringlee Thank you