Tensor.cuda() takes ~2 seconds

Hello there!

I am using captum to run some explanations of my model (vgg16). When using occlusion and setting pertubations_per_eval=100 (allows to run 100 forward passes simultaneously), it needs ~23GB of GPU. In the next (and every subsequent) iteration of my loop, calling tensor.cuda() on input takes ~2 seconds. Is it because of the GPU context switching? Is there something I can do about it?

I am using batch_size=1 and the GPU has ~46GB.

I am happy to provide any code samples, if needed.

Thanks!

How are you measuring the time? To properly profile the GPU execution time you would need to synchronize the GPU. Otherwise ops might be accumulating runtimes.

Hey!

Initially I was just printing the time difference of the two calls, I did not realize that the call is not blocking, if the computation happens on GPU. I looked it up, and this should do the job:

for t, _, _ in dataloader:
    c = time.time()
    t = t.cuda()
    torch.cuda.synchronize()
    print(time.time() - c)
    a = time.time()
    occ.attribute(tensor, ..., pertubations_per_eval=100)
    torch.cuda.synchronize()
    print(time.time() - a)
    ...

Now the values I am getting are:
t.cuda → 0.000 seconds - As expected, since it is one (3, 512, 512) tensor
occ.attribute → ~1.98 seconds - Here comes my confusion, I would expect setting the value to 100 (instead of original 1, where it took ~2.25 seconds) would make it run much faster, but it did not.

Therefore I am curious whether the lack of speed-up is something related to the GPU having to free and allocate a lot of memory, or something happening in Captum (maybe generating the occlusion patches, haven’t looked into it yet).

PyTorch caches the memory and does not repeatedly allocate and free it as it would result in a terrible performance. I’m not familiar enough with Captum and don’t know what this method does, but you could profile it’s internals.