GPU tensor access time overhead after resnet feature extraction

Markos-NTUA · November 19, 2020, 3:36pm

Hello everyone,
while using pytorch’s fasterrcnn_resnet50_fpn I noticed that after passing a list of images from resnet’s backbone there is a time interval (e.g. for a batch of 8 images its ~0.22 sec) where any following tensor gpu operation will have to wait in order to be completed. All tensors and models are on the GPU. Here is code for reproduction:

from torchvision.models.detection import fasterrcnn_resnet50_fpn
import torch
import time

_backbone = fasterrcnn_resnet50_fpn(pretrained=True).cuda()
for name, param in _backbone.named_parameters():
    if name.startswith('backbone') or name.startswith('roi_heads.box_head'):
        param.requires_grad = False
backbone = _backbone.backbone
transform = _backbone.transform
images = [torch.rand(3, 768, 1024) for _ in range(8)]
foo = torch.tensor([0, 1, 2, 3]).cuda()

images, _ = transform(images, None)

base_features = backbone(images.tensors.cuda());
t = time.time()
# torch.max(foo) # max has no problem
torch.unique(foo) #unique has
print(time.time()-t)

t = time.time()
torch.unique(foo)
print(time.time()-t)

The output of this code for the setup described below is:

0.22019076347351074
0.0008745193481445312

Setup:
python: 3.7
torch: 1.7.0
torchvision: 0.8.1
cudatoolkit: 10.1
GPU: nvidia 20180Ti
Ubuntu: 18.4.4 LTS

I also tried on 1080Ti and TitanX and with previous python/pytorch/torchvision releases having almost same results.

Thanks in advance!

albanD · November 19, 2020, 4:10pm

Hi,

This happens because the whole CUDA API is actually asynchronous and only blocks when the work queue on the GPU is full and you have to wait for it to process or because you want some data on the cpu side.
You can use torch.cuda.synchronize() to force synchronization of the GPU so that it waits on the resnet computations to finish.

Markos-NTUA · November 19, 2020, 4:44pm

Thanks a lot for your answer!
I thought that passing e.g. 8 images through resnet would not be so much slower than passing just 1, since they are processed in parallel, but this does not seem to be the case. Do you have any idea how could I avoid this overhead?

albanD · November 19, 2020, 5:00pm

This is not really an “overhead”, it is the time the GPU takes to actually perform the compute