What is the most efficient way to do a multi batch prediction in PyTorch?
I have a bunch of images (Dogs vs Cats test set to be precise) that I want to run prediction on. I call the following code in a loop over Dataloader Iterator with a batch size of 64 and store the result int a torch tensor. How should I efficiently collect all the results on the GPU and transfer it to host?
# Loop over
def step(self, inputs):
data, label = inputs # ignore label
outputs = self.model(data)
_, preds = torch.max(outputs.data , 1)
# preds, outputs are cuda tensors. Right?
return preds, outputs
def predict(self, dataloader):
for i, batch in enumerate(dataloader):
pred, output = self.step(batch)
# How to collect these results efficiently without incurring performance penalty ?
Are GPU to Host copies also affected by Pinned Memory? I was wondering if we could collect all the results in GPU and transfer it one shot to CPU.
As you said, these copies didn’t affect my run time significantly. My network is taking 250ms. But I was wondering if there was any better way to do this.
Right now I have a torch Tensor pre allocated with the number of elements and in each iteration, I am indexing from n to n + batch_size on that tensor and storing the values. Hope I am not doing anything wrong by that.
def predict(self, dataloader):
num_elements = len(dataloader.dataset)
num_batches = len(dataloader)
batch_size = dataloader.batch_size
predictions = torch.zeros(num_elements)
for i, batch in enumerate(dataloader):
start = i*batch_size
end = start + batch_size
if i == num_batches - 1:
end = num_elements
pred, output = self.step(batch)
predictions[start:end] = pred
I am also looking a more efficient way to make prediction of the entire validation or testing dataset. The way i did before was like
def pytorch_predict(model, test_loader, device):
'''
Make prediction from a pytorch model
'''
# set model to evaluate model
model.eval()
y_true = torch.tensor([], dtype=torch.long, device=device)
all_outputs = torch.tensor([], device=device)
# deactivate autograd engine and reduce memory usage and speed up computations
with torch.no_grad():
for data in test_loader:
inputs = [i.to(device) for i in data[:-1]]
labels = data[-1].to(device)
outputs = model(*inputs)
y_true = torch.cat((y_true, labels), 0)
all_outputs = torch.cat((all_outputs, outputs), 0)
y_true = y_true.cpu().numpy()
_, y_pred = torch.max(all_outputs, 1)
y_pred = y_pred.cpu().numpy()
y_pred_prob = F.softmax(all_outputs, dim=1).cpu().numpy()
return y_true, y_pred, y_pred_prob
I understand that transferring data between GPU and CPU is costly so that I decided to do it at once. However I am not sure if transferring step by step is faster than transferring in one shot.
From your code, you allocate memory to the prediction array first however i think you still need to transfer your data from GPU to CPU if your model is on GPU and outputs are also produced on GPU.