C++; problems with inference on GPU [model developed in Python; traced/saved; run in C++]

drcdr · March 1, 2019, 5:16am

I have a similar issue as Problems with inference on CPU on C++ (Expected object of backend CUDA but got backend CPU for argument #2 'weight'), but in reverse: my model runs on the CPU, but not the GPU. I’ve run other models fine on both the CPU/GPU, so I think the problem is with this traced model, an SSD-based object detector. I think I’ve narrowed down the problem, but I still have a question.

The model input is torch::randn({ 1, 3, 300, 300 }) and the output is a tuple (of boxes and scores). The std::runtime_error output is:

expected type CUDAFloatType but got CPUFloatType (compute_types at ..\aten\src\ATen\native\TensorIterator.cpp:130)
(no backtrace available):
operation failed in interpreter:
  location10 = torch.contiguous(_31)
  _32 = ops.prim.NumToTensor(torch.size(location10, 0))
  location11 = torch.view(location10, [int(_32), -1, 4])
  input97 = torch.cat([_2, _7, _13, _19, _25, confidence11], 1)
  locations = torch.cat([_4, _10, _16, _22, _28, location11], 1)
  _33 = torch.softmax(input97, 2)
  priors = torch.unsqueeze(CONSTANTS.c0, 0)
  _34 = torch.mul(torch.slice(locations, 2, 0, 2, 1), CONSTANTS.c1)
  _35 = torch.slice(priors, 2, 2, 9223372036854775807, 1)
  _36 = torch.add(torch.mul(_34, _35), torch.slice(priors, 2, 0, 2, 1), alpha=1)
                  ~~~~~~~~~ <--- HERE
  _37 = torch.slice(locations, 2, 2, 9223372036854775807, 1)
  _38 = torch.exp(torch.mul(_37, CONSTANTS.c2))
  _39 = torch.slice(priors, 2, 2, 9223372036854775807, 1)
  locations0 = torch.cat([_36, torch.mul(_38, _39)], 2)
  _40 = torch.slice(locations0, 2, 2, 9223372036854775807, 1)
  _41 = torch.sub(torch.slice(locations0, 2, 0, 2, 1), torch.div(_40, CONSTANTS.c3), alpha=1)
  _42 = torch.slice(locations0, 2, 2, 9223372036854775807, 1)
  _43 = torch.add(torch.slice(locations0, 2, 0, 2, 1), torch.div(_42, CONSTANTS.c3), alpha=1)
  return (_33, torch.cat([_41, _43], 2))

and points to a section within the model’s forward() function:

    if priors.dim() + 1 == locations.dim():
        priors = priors.unsqueeze(0)
    return torch.cat([
        locations[..., :2] * center_variance * priors[..., 2:] + priors[..., :2],
        torch.exp(locations[..., 2:] * size_variance) * priors[..., 2:]
    ], dim=locations.dim() - 1)

So:

_34 = locations[…, :2] * center_variance
_35 = priors[…, 2:]
I can’t quite tell if the error message refers to _34 or _35 on the CPU

In init, the module is doing:

if device:
    self.device = device
else:
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if is_test:  # here, is_test is true and device is CPU
    self.config = config   
    self.priors = config.priors  # have tried with/without .to(self.device)

In C++, it’s the usual:

if (cuda) {
	timgf = timgf.to(at::Device(torch::kCUDA, gpuid));
	module->to(at::Device(torch::kCUDA, gpuid));
	if (module == nullptr) {
		printf("Null pointer converting to cuda.\n");  exit(1);
	}
}
inputs.clear();
inputs.push_back(timgf);
output_tuple = module->forward(inputs);

My questions are:

In C++, shouldn’t the C++ .to(at::kCUDA) function override all of the model’s storage variables to the GPU? Or, do the module’s init function’s specifications override that?
Do I need to specify - or not specify - the storage locations in init()?
More generally - what’s the best pattern for handling model storage declarations when JIT tracing / C++ is involved?