I think it happens because you are trying to access a value of a vector stored on the device from the host. Run the same operation inside a kernel or pass the top_idxs to the cpu and tell me what happens please.
Are you sure that you are loading a pretrained model ? And that you are properly loading its weights ?
Another side question ? Why are you using the C++ API ? If you want to do inference with deep CNN models, the python overhead may not be significant, since the bottleneck will probably be the convolutions, which are done by cuDNN in both API’s (assuming that you use a GPU for inference).
I am using pretrained models from torchvision.models. So, I believe it should do the right thing
I have been using PyTorch python API for the past few months. Two reasons why I am exploring C++ API:
(a) I was under the impression it will be faster. But, based on your comments it shouldn’t be that different compared to python API
(b) For deployment in production
In-fact, from my experiments I see C++ API is slower than python API for some models I tried:
dry_run = 5 # use 5 iterations to warm up
num_batches = 10
for i in range(dry_run+num_batches):
if i == dry_run:
tic = time.time()
batch_TensorTemp = torch.autograd.Variable(batch_Tensor)
if dev == "gpu":
# move tensor to GPU
batch_TensorTemp = batch_TensorTemp.cuda()
output_batch = model(batch_TensorTemp)
if dev == "gpu":
# move ouput to cpu
output_batch = output_batch.data.cpu()
end = time.time()
print("Network: {}, Batch-size: {}, Images/Sec: {}\n".format(network, batch_size, (num_batches*batch_size/(end - tic))))