Efficient method to gather all predictions

What is the most efficient way to do a multi batch prediction in PyTorch?

I have a bunch of images (Dogs vs Cats test set to be precise) that I want to run prediction on. I call the following code in a loop over Dataloader Iterator with a batch size of 64 and store the result int a torch tensor. How should I efficiently collect all the results on the GPU and transfer it to host?

# Loop over 
def step(self, inputs):
    data, label = inputs # ignore label
    outputs = self.model(data)
    _, preds = torch.max(outputs.data , 1)
    # preds, outputs  are cuda tensors. Right?
    return preds, outputs

def predict(self, dataloader):
    for i, batch in enumerate(dataloader):
        pred, output = self.step(batch)
        # How to collect these results efficiently without incurring performance penalty ?
1 Like

i think it should be reasonably efficent to call .cpu() on pred and put it in a list.

prediction_list = []
def predict(self, dataloader):
    for i, batch in enumerate(dataloader):
        pred, output = self.step(batch)
        prediction_list.append(pred.cpu())

A more extreme case is to use CUDA pinned memory on the CPU, http://pytorch.org/docs/master/notes/cuda.html?highlight=pinned#best-practices
However, in your use-case i’m not sure you’ll gain much with that.

Are GPU to Host copies also affected by Pinned Memory? I was wondering if we could collect all the results in GPU and transfer it one shot to CPU.
As you said, these copies didn’t affect my run time significantly. My network is taking 250ms. But I was wondering if there was any better way to do this.

Right now I have a torch Tensor pre allocated with the number of elements and in each iteration, I am indexing from n to n + batch_size on that tensor and storing the values. Hope I am not doing anything wrong by that.

def predict(self, dataloader):
    num_elements = len(dataloader.dataset)
    num_batches = len(dataloader)
    batch_size = dataloader.batch_size
    predictions = torch.zeros(num_elements)
    for i, batch in enumerate(dataloader):
        start = i*batch_size
        end = start + batch_size
        if i == num_batches - 1:
            end = num_elements
        pred, output = self.step(batch)
        predictions[start:end] = pred

I am indexing from n to n + batch_size on that tensor and storing the values.

This seems fine.

I am also looking a more efficient way to make prediction of the entire validation or testing dataset. The way i did before was like

def pytorch_predict(model, test_loader, device):
    '''
    Make prediction from a pytorch model 
    '''
    # set model to evaluate model
    model.eval()
    
    y_true = torch.tensor([], dtype=torch.long, device=device)
    all_outputs = torch.tensor([], device=device)
    
    # deactivate autograd engine and reduce memory usage and speed up computations
    with torch.no_grad():
        for data in test_loader:
            inputs = [i.to(device) for i in data[:-1]]
            labels = data[-1].to(device)
            
            outputs = model(*inputs)
            y_true = torch.cat((y_true, labels), 0)
            all_outputs = torch.cat((all_outputs, outputs), 0)
    
    y_true = y_true.cpu().numpy()  
    _, y_pred = torch.max(all_outputs, 1)
    y_pred = y_pred.cpu().numpy()
    y_pred_prob = F.softmax(all_outputs, dim=1).cpu().numpy()
    
    return y_true, y_pred, y_pred_prob

I understand that transferring data between GPU and CPU is costly so that I decided to do it at once. However I am not sure if transferring step by step is faster than transferring in one shot.

From your code, you allocate memory to the prediction array first however i think you still need to transfer your data from GPU to CPU if your model is on GPU and outputs are also produced on GPU.