I have a similar issue as Problems with inference on CPU on C++ (Expected object of backend CUDA but got backend CPU for argument #2 'weight'), but in reverse: my model runs on the CPU, but not the GPU. I’ve run other models fine on both the CPU/GPU, so I think the problem is with this traced model, an SSD-based object detector. I think I’ve narrowed down the problem, but I still have a question.
The model input is torch::randn({ 1, 3, 300, 300 })
and the output is a tuple (of boxes and scores). The std::runtime_error output is:
expected type CUDAFloatType but got CPUFloatType (compute_types at ..\aten\src\ATen\native\TensorIterator.cpp:130)
(no backtrace available):
operation failed in interpreter:
location10 = torch.contiguous(_31)
_32 = ops.prim.NumToTensor(torch.size(location10, 0))
location11 = torch.view(location10, [int(_32), -1, 4])
input97 = torch.cat([_2, _7, _13, _19, _25, confidence11], 1)
locations = torch.cat([_4, _10, _16, _22, _28, location11], 1)
_33 = torch.softmax(input97, 2)
priors = torch.unsqueeze(CONSTANTS.c0, 0)
_34 = torch.mul(torch.slice(locations, 2, 0, 2, 1), CONSTANTS.c1)
_35 = torch.slice(priors, 2, 2, 9223372036854775807, 1)
_36 = torch.add(torch.mul(_34, _35), torch.slice(priors, 2, 0, 2, 1), alpha=1)
~~~~~~~~~ <--- HERE
_37 = torch.slice(locations, 2, 2, 9223372036854775807, 1)
_38 = torch.exp(torch.mul(_37, CONSTANTS.c2))
_39 = torch.slice(priors, 2, 2, 9223372036854775807, 1)
locations0 = torch.cat([_36, torch.mul(_38, _39)], 2)
_40 = torch.slice(locations0, 2, 2, 9223372036854775807, 1)
_41 = torch.sub(torch.slice(locations0, 2, 0, 2, 1), torch.div(_40, CONSTANTS.c3), alpha=1)
_42 = torch.slice(locations0, 2, 2, 9223372036854775807, 1)
_43 = torch.add(torch.slice(locations0, 2, 0, 2, 1), torch.div(_42, CONSTANTS.c3), alpha=1)
return (_33, torch.cat([_41, _43], 2))
and points to a section within the model’s forward() function:
if priors.dim() + 1 == locations.dim():
priors = priors.unsqueeze(0)
return torch.cat([
locations[..., :2] * center_variance * priors[..., 2:] + priors[..., :2],
torch.exp(locations[..., 2:] * size_variance) * priors[..., 2:]
], dim=locations.dim() - 1)
So:
- _34 = locations[…, :2] * center_variance
- _35 = priors[…, 2:]
- I can’t quite tell if the error message refers to _34 or _35 on the CPU
In init, the module is doing:
if device:
self.device = device
else:
self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if is_test: # here, is_test is true and device is CPU
self.config = config
self.priors = config.priors # have tried with/without .to(self.device)
In C++, it’s the usual:
if (cuda) {
timgf = timgf.to(at::Device(torch::kCUDA, gpuid));
module->to(at::Device(torch::kCUDA, gpuid));
if (module == nullptr) {
printf("Null pointer converting to cuda.\n"); exit(1);
}
}
inputs.clear();
inputs.push_back(timgf);
output_tuple = module->forward(inputs);
My questions are:
- In C++, shouldn’t the C++ .to(at::kCUDA) function override all of the model’s storage variables to the GPU? Or, do the module’s init function’s specifications override that?
- Do I need to specify - or not specify - the storage locations in init()?
- More generally - what’s the best pattern for handling model storage declarations when JIT tracing / C++ is involved?