Say, we have 30K records and the dataloader is feeding in batches of size 32 to a V100. The dataloader has 8 num_workers
.
The setup is as follows:
#model init
#dataset init
for (id_batch, img_batch) in data_loader:
probs = mysterious_function(img_batch)
The forward pass happens inside the mysterious_function
:
def mysterious_function(image_batch):
image_batch_gpu_var = Variable(torch.squeeze(image_batch), requires_grad=False).type(self.dtype_float).cuda(device=self.gpu_id)
prob_predict_gpu = self.seg_net(image_batch_gpu_var)['probs']
prob_predict_res = prob_predict_gpu.data.cpu().numpy()
return prob_predict_res
The time tracker computes the execution time of the decorated function and logs the result right to stdout after it finishes execution. Here is the output of the time tracker:
mysterious_function running time on batch 0 0.141 seconds
mysterious_function running time on batch 1 2.420 seconds
mysterious_function running time on batch 2 0.106 seconds
mysterious_function running time on batch 3 2.391 seconds
mysterious_function running time on batch 4 0.113 seconds
mysterious_function running time on batch 5 2.341 seconds
mysterious_function running time on batch 6 0.117 seconds
mysterious_function running time on batch 7 2.393 seconds
mysterious_function running time on batch 8 0.122 seconds
mysterious_function running time on batch 9 2.257 seconds
Curious to know if this had anything to do with the dataloader read/decode/resizing from NFS, I ran the same forward pass twice on the same batch of data.
def mysterious_function(image_batch):
cp1 = time()
image_batch_gpu_var = Variable(torch.squeeze(image_batch), requires_grad=False).type(self.dtype_float).cuda(device=self.gpu_id)
cp2 = time()
print('moving image batch of shape {} to GPU {}'.format(image_batch_var.shape, cp2 - cp1)
prob_predict_gpu = self.seg_net(image_batch_gpu_var)['probs']
cp3 = time()
print('model forward pass 1 out of 2 {}'.format(image_batch_var.shape, cp3 - cp2))
prob_predict_gpu = self.seg_net(image_batch_gpu_var)['probs']
cp4 = time()
print('model forward pass 2 out of 2 {}'.format(image_batch_var.shape, cp4 - cp3))
#note time taken
prob_predict_res = prob_predict_gpu.data.cpu().numpy()
return prob_predict_res
[] mysterious function running time 1.303
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.0034639835357666016 seconds
model forward pass 1 of 2 takes 0.12161993980407715 seconds
model forward pass 2 of 2 takes 1.1894159317016602 seconds
[] mysterious function running time 1.317
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.003448963165283203 seconds
model forward pass 1 of 2 takes 0.1061561107635498 seconds
model forward pass 2 of 2 takes 1.1635644435882568 seconds
mysterious function running time 1.275
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.0034742355346679688 seconds
model forward pass 1 of 2 takes 0.10687828063964844 seconds
model forward pass 2 of 2 takes 0.8193213939666748 seconds
[] mysterious function running time 0.932
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.0034782886505126953 seconds
model forward pass 1 of 2 takes 0.09704208374023438 seconds
model forward pass 2 of 2 takes 1.1900320053100586 seconds
[] mysterious function running time 1.293
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.003448486328125 seconds
segnet forward pass 1 of 2 takes 0.13462281227111816 seconds
segnet forward pass 2 of 2 takes 1.1501150131225586 seconds
[] mysterious function running time 1.291
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.00835871696472168 seconds
segnet forward pass 1 of 2 takes 0.09815263748168945 seconds
segnet forward pass 2 of 2 takes 1.0340092182159424 seconds
[] mysterious function running time 1.143
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.0038101673126220703 seconds
segnet forward pass 1 of 2 takes 0.1068427562713623 seconds
segnet forward pass 2 of 2 takes 1.1246507167816162 seconds
[] mysterious function running time 1.238
moving image batch of shape torch.Size([32, 3, 224, 224]) to GPU 0.003797292709350586 seconds
segnet forward pass 1 of 2 takes 0.10389494895935059 seconds
segnet forward pass 2 of 2 takes 1.2108190059661865 seconds
As you can see, the 2nd forward pass takes almost 10x more time than the first forward pass on the same batch.
So, back to one forward pass and I tried to free up cuda memory after the first forward pass, but it did not help:
@time_tracker(_log)
def mysterious_function(image_batch):
image_batch_gpu_var = Variable(torch.squeeze(image_batch), requires_grad=False).type(self.dtype_float).cuda(device=self.gpu_id)
prob_predict_gpu = self.seg_net(image_batch_gpu_var)['probs']
torch.cuda.empty_cache()
prob_predict_res = prob_predict_gpu.data.cpu().numpy()
return prob_predict_res
mysterious function running time on batch 0 0.729 seconds
mysterious function running time on batch 1 2.190seconds
mysterious function running time on batch 2 0.171seconds
mysterious function running time on batch 3 1.912seconds
mysterious function running time on batch 4 0.152seconds
mysterious function running time on batch 5 2.329seconds
Something seems to be wrong with the odd batch.