Tensor.cuda() takes ~2 seconds

krebso · December 11, 2023, 2:46pm

Hello there!

I am using captum to run some explanations of my model (vgg16). When using occlusion and setting pertubations_per_eval=100 (allows to run 100 forward passes simultaneously), it needs ~23GB of GPU. In the next (and every subsequent) iteration of my loop, calling tensor.cuda() on input takes ~2 seconds. Is it because of the GPU context switching? Is there something I can do about it?

I am using batch_size=1 and the GPU has ~46GB.

I am happy to provide any code samples, if needed.

Thanks!

ptrblck · December 12, 2023, 12:33am

How are you measuring the time? To properly profile the GPU execution time you would need to synchronize the GPU. Otherwise ops might be accumulating runtimes.

krebso · December 16, 2023, 5:43pm

Hey!

Initially I was just printing the time difference of the two calls, I did not realize that the call is not blocking, if the computation happens on GPU. I looked it up, and this should do the job:

for t, _, _ in dataloader:
    c = time.time()
    t = t.cuda()
    torch.cuda.synchronize()
    print(time.time() - c)
    a = time.time()
    occ.attribute(tensor, ..., pertubations_per_eval=100)
    torch.cuda.synchronize()
    print(time.time() - a)
    ...

Now the values I am getting are:
t.cuda → 0.000 seconds - As expected, since it is one (3, 512, 512) tensor
occ.attribute → ~1.98 seconds - Here comes my confusion, I would expect setting the value to 100 (instead of original 1, where it took ~2.25 seconds) would make it run much faster, but it did not.

Therefore I am curious whether the lack of speed-up is something related to the GPU having to free and allocate a lot of memory, or something happening in Captum (maybe generating the occlusion patches, haven’t looked into it yet).

ptrblck · December 17, 2023, 2:47pm

PyTorch caches the memory and does not repeatedly allocate and free it as it would result in a terrible performance. I’m not familiar enough with Captum and don’t know what this method does, but you could profile it’s internals.