[pytorch0.3.1] Forward pass takes 10x longer time for every 2nd batch inference

Say, we have 30K records and the dataloader is feeding in batches of size 32 to a V100. The dataloader has 8 num_workers.

The setup is as follows:

#model init
#dataset init
for (id_batch, img_batch) in data_loader:
     probs = mysterious_function(img_batch)

The forward pass happens inside the mysterious_function :

def mysterious_function(image_batch):
    image_batch_gpu_var = Variable(torch.squeeze(image_batch), requires_grad=False).type(self.dtype_float).cuda(device=self.gpu_id)
    prob_predict_gpu = self.seg_net(image_batch_gpu_var)['probs']
    prob_predict_res = prob_predict_gpu.data.cpu().numpy()
    return prob_predict_res

The time tracker computes the execution time of the decorated function and logs the result right to stdout after it finishes execution. Here is the output of the time tracker:

 mysterious_function running time on batch 0 0.141 seconds
 mysterious_function running time on batch 1 2.420 seconds
 mysterious_function running time on batch 2 0.106 seconds
 mysterious_function running time on batch 3 2.391 seconds
 mysterious_function running time on batch 4 0.113 seconds
 mysterious_function running time on batch 5 2.341 seconds
 mysterious_function running time on batch 6 0.117 seconds
 mysterious_function running time on batch 7 2.393 seconds
 mysterious_function running time on batch 8 0.122 seconds
 mysterious_function running time on batch 9 2.257 seconds

Curious to know if this had anything to do with the dataloader read/decode/resizing from NFS, I ran the same forward pass twice on the same batch of data.

def mysterious_function(image_batch):
    cp1 = time()
    image_batch_gpu_var = Variable(torch.squeeze(image_batch), requires_grad=False).type(self.dtype_float).cuda(device=self.gpu_id)
    cp2 = time()
    print('moving image batch of shape {} to GPU {}'.format(image_batch_var.shape, cp2 - cp1)
    prob_predict_gpu = self.seg_net(image_batch_gpu_var)['probs'] 
    cp3 = time()
    print('model forward pass 1 out of 2 {}'.format(image_batch_var.shape, cp3 - cp2))
    prob_predict_gpu = self.seg_net(image_batch_gpu_var)['probs']
    cp4 = time()
    print('model forward pass 2 out of 2 {}'.format(image_batch_var.shape, cp4 - cp3))
    #note time taken
    prob_predict_res = prob_predict_gpu.data.cpu().numpy()
    return prob_predict_res
[] mysterious function running time 1.303
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.0034639835357666016 seconds
model forward pass 1 of 2 takes 0.12161993980407715 seconds
model forward pass 2 of 2 takes 1.1894159317016602 seconds
[] mysterious function running time 1.317
moving image batch of shape torch.Size([32,  3, 224, 224]) to GPU 0.003448963165283203 seconds
model forward pass 1 of 2 takes 0.1061561107635498 seconds
model forward pass 2 of 2 takes 1.1635644435882568 seconds
mysterious function running time 1.275
moving image batch of shape torch.Size([32,  3, 224, 224]) to GPU 0.0034742355346679688 seconds
model forward pass 1 of 2 takes 0.10687828063964844 seconds
model forward pass 2 of 2 takes 0.8193213939666748 seconds
[] mysterious function running time 0.932
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.0034782886505126953 seconds
model forward pass 1 of 2 takes 0.09704208374023438 seconds
model forward pass 2 of 2 takes 1.1900320053100586 seconds
[] mysterious function running time 1.293
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.003448486328125 seconds
segnet forward pass 1 of 2 takes 0.13462281227111816 seconds
segnet forward pass 2 of 2 takes 1.1501150131225586 seconds
[] mysterious function running time 1.291
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.00835871696472168 seconds
segnet forward pass 1 of 2 takes 0.09815263748168945 seconds
segnet forward pass 2 of 2 takes 1.0340092182159424 seconds
[] mysterious function running time 1.143
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.0038101673126220703 seconds
segnet forward pass 1 of 2 takes 0.1068427562713623 seconds
segnet forward pass 2 of 2 takes 1.1246507167816162 seconds
[] mysterious function running time 1.238
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.003797292709350586 seconds
segnet forward pass 1 of 2 takes 0.10389494895935059 seconds
segnet forward pass 2 of 2 takes 1.2108190059661865 seconds

As you can see, the 2nd forward pass takes almost 10x more time than the first forward pass on the same batch.

So, back to one forward pass and I tried to free up cuda memory after the first forward pass, but it did not help:

@time_tracker(_log)
def mysterious_function(image_batch):
    image_batch_gpu_var = Variable(torch.squeeze(image_batch), requires_grad=False).type(self.dtype_float).cuda(device=self.gpu_id)
    prob_predict_gpu = self.seg_net(image_batch_gpu_var)['probs'] 
    torch.cuda.empty_cache()
    prob_predict_res = prob_predict_gpu.data.cpu().numpy()
    return prob_predict_res
mysterious function running time on batch 0 0.729 seconds
mysterious function running time on batch 1 2.190seconds
mysterious function running time on batch 2 0.171seconds
mysterious function running time on batch 3 1.912seconds
mysterious function running time on batch 4 0.152seconds
mysterious function running time on batch 5 2.329seconds

Something seems to be wrong with the odd batch.

Make sure you run your “inference” section in a

with torch.set_grad_enabled(False):
    # your "inference" code here

Otherwise, it will be building on the autograd graph in the background and consuming more and more memory.

Should have probably mentioned this earlier, I am using pytorch 0.3.1, is there an equivalent?

Edit:

From the PyTorch 0.4.0 Migration Guide | PyTorch :

The volatile flag is now deprecated and has no effect. Previously, any computation that involves a Variable with volatile=True wouldn’t be tracked by autograd . This has now been replaced by a set of more flexible context managers including torch.no_grad() , torch.set_grad_enabled(grad_mode) , and others

Maybe I didn’t understand this fully, but how would setting volatile to True for the batch Variable disable the autograph build?

I think there’s some way to do it in PyTorch 0.3, but I don’t remember exactly since it has been quite some time ago since I used 0.3, sorry :stuck_out_tongue:

It may have been the volatile=True setting you mentioned. In any case, due to this and various other improvements, I really recommend switching to PyTorch 1.0 if you can.

EDIT: if you don’t want to make any big code modification to your 0.3 code, you could also call .backward() on the model after each forward pass during inference. It will probably slow down the inference (because you have an addition backprop step) but only by a constant, because backward() will free the graph after each call (so you don’t let the graph grow and grow). However, i really recommend upgrading and using the context managers.

1 Like

Setting volatile to True, and torch.cuda.empty_cache did not help. The issue still persists.

Instead of 2, can you do forward the same batch 10 times and check
Your model might be doing something fishy, what is it forward function like?

I finally figured a way around this issue. The problem was that we had a group of worker threads launched in the CPU to collect the detached outputs from the GPU and perform some post-processing before saving them to disk.

Let me explain the complete setup:

  • Launch a set of worker threads which collect images from an asynchronous message queue and feed them to a torch.multiprocessing queue P.
  • Launch a process task which dequeues items from P and prepares a dataloader and feeds batches of these items to the GPU for inference.
  • Append items from each batch into a list <item_id, prob> into a torch.multiprocessing queue Q
  • Launch of set of postprocessing worker threads which collect the items from the Q, do some manipulation in CPU and write them to disk.

With the postprocessing worker threads, my inference rate was around 25 items per second on V100 (10x longer for every batch), when I removed these worker threads, I found that the inference rate increased to 160 records per second. Please note that I manually did the postprocessing before dequeuing the next dataset of items from P- this takes approx 25% of the total time to process a dataset and is essentially idle time for the GPU but I could not improve beyond that.

Story wasnt over yet :slight_smile: ! After removing the postprocesing workers, I was able to get a 6x improvement in inference time with a high GPU memory and usage as shown below. However, do you notice anything odd in the graph below?



The CPU usage keeps increasing over time and before we know the system is bound to crash, once again for no fault of the GPU. Fortunately, this can be resolved by a small trick, by replacing the prob_predict_gpu from prob_predict_gpu.data to prob_predict_gpu.detach.

.data() is an old api and should not be used anymore. The detach() method constructs a new view on a tensor that is declared not to need gradients and therefore the subgraph involving the view would not be tracked (nice example here PyTorch .detach() method | B. Nikolic Software and Computing Blog)

def mysterious_function(image_batch):
    image_batch_gpu_var = Variable(torch.squeeze(image_batch), requires_grad=False).type(self.dtype_float).cuda(device=self.gpu_id)
    prob_predict_gpu = self.seg_net(image_batch_gpu_var)['probs'] 
    torch.cuda.empty_cache()
    prob_predict_res = prob_predict_gpu.detach.cpu().numpy() #prob_predict_gpu.data.cpu().numpy()
    return prob_predict_res

The final CPU and GPU utilization looked like this:



1 Like

deleted since I started I new topic