...some model
with torch.no_grad():
result = self.model(tensor)
t0 = time.time()
print(result.shape)
result = result.cpu()
print(f"Took seconds {time.time() - t0}")

has output

torch.Size([1, 2, 1033, 1033])
Took seconds 4.281269311904907
torch.Size([1, 2, 833, 833])
Took seconds 2.7177305221557617
torch.Size([1, 2, 673, 673])
Took seconds 1.7987253665924072

I did to know that it is detach() operation. detach() operation takes seconds for a tensor. Why?

Calls to cuda operations can return asynchronously, but then a subsequent
call that attempts to use the resulting tensor will block until the cuda call
actually finishes.

Because of this you need to use torch.cuda.synchronize() in order to
get meaningful timings. Try this:

with torch.no_grad():
result = self.model(tensor)
torch.cuda.synchronize() # wait for self.model() to actually finish
t0 = time.time()
print(result.shape)
result = result.cpu()
torch.cuda.synchronize()
print(f"Took seconds {time.time() - t0}") # only times result.shape and result.cpu()

Yes, you are most likely seeing the time it takes for your model to run
as synchronize() waits for the gpu to finish what it was doing. It could
also possibly include time for something the gpu was still doing before
you called result = self.model(tensor).

Try something like this:

with torch.no_grad():
t0 = time.time()
torch.cuda.synchronize()
print(f"First synchronize(): {time.time() - t0}")
t0 = time.time()
result = self.model(tensor)
torch.cuda.synchronize()
print(f"Time for model (tensor): {time.time() - t0}")
t0 = time.time()
result = result.cpu()
torch.cuda.synchronize()
print(f"Time for result.cpu() {time.time() - t0}")

The point is that hypothetically your gpu might have still been doing
something when you called model (tensor). The firstsynchronize()
waits for the gpu to finish whatever that might have been (if anything).
The time for the first synchronize() tells you how long this whatever
(if anything) took.

This example then calls model (tensor) after which he second synchronize() waits for model (tensor) to finish. This way the
second timing is therefore just (except for some de minimistime()
and print() calls) the amount of time model (tensor) took.

(And then this example times how long it takes to move result to
the cpu, which could also be an interesting number.)

At its outermost level, a pytorch tensor is a data structure in cpu memory.
This data structure contains a “handle” to the underlying data which may
live in either cpu or gpu memory. (This tensor data structure also contains
a bunch of other useful information.)

When you execute result = self.model (tensor) you set the python
reference result to refer to a pytorch-tensor data structure (on the cpu).
At this point, the outermost shell of this tensor exists (on the cpu) and it
wraps gpu data which may or may not be ready yet.

But your python script can continue executing – for example, it could do
some cpu computation or retrieve some data from a different gpu – as
long as it doesn’t attempt to use the not-yet-ready gpu tensor.

To say it another way, you do already have the shell of the result tensor,
but you don’t yet have the full result tensor – in particular, you don’t have
its actual data.

Pytorch has a lot of heavy-duty machinery under the hood, and this is
part of it. This scheme gives you the convenience of programming in
python (which runs on the cpu) and use python to manipulate gpu tensors
nearly transparently.