in all the results above, torch.utils.benchmark
is set to true.
the slowness over the first sample of minbatch is recurrent over all minibatchs not just the first one.
so, the warmup couldnt be the issue.
i did some more digging, and now i am sure that the slowness is caused by cuda implicit synchronization. it is not torch.histc, tensor.min/max, or python if
.
to clarify, i work on a tensor z
of size (32, 1, h, w) that is loaded from disk using pytorch dataloader sample by sample with multiple workers and stitched together the same way we do with loading images to form a minibatch.
after loading our tensor z
, i do the only following cuda operations on it very early in the code:
# z is loaded using datalaoder
with torch.no_grad():
# z: (bsz, 1, h, w)
assert z.ndim == 4
# Quick fix
z = torch.nan_to_num(z, nan=0.0, posinf=1., neginf=0.0)
z = F.interpolate(z,
image_size,
mode='bilinear',
align_corners=False) # (bsz, 1, h, w)
and that is all.
after this, i do some other stuff that do not involve z
at all, such as forward in model.
then, i call a function that iterate on each sample in z
and do some cuda stuff.
what i learned is that the FIRST CUDA OPERATION ON THE FIRST SAMPLE will cause A CUDA IMPLICIT SYNCHRONIZATION. either it is z.min()/max, torch.histc, python if min=max, … it does not matter. the moment cuda is involved, it calls synch. it is like kernels on z
are not done!!!
# prepare z
# other stuff that do not involve z...
def func(z_):
for i in range(bsz):
process(z_[i])
func(z)
ok, let say they are not done.
if and only if add torch.cuda.synchronize()
only right before starting the loop, all samples will run in .0003sec otherwise, the first will run in .33se while the rest in .0003sec. meaning the synch works, and z
is ready.
but if ask for synch right after preparing z
, it wont work!!!
later when staring working on z
, cuda will synch on its own… it is so weird.
# prepare z
# torch.cuda.synchronize() # does not work here. we get the same slow behavior.
# other stuff that do not involve z...
def func(z_):
torch.cuda.synchronize() # will work here. z will be ready and no need for cuda implicit synch.
for i in range(bsz):
process(z_[i])
func(z)
this implicit cuda synch happens even if i create z as zero right before the loop such
# prepare z
# other stuff that do not involve z...
def func(z_):
# forget about our z... lets create one new one full of 0.
z_ = torch.zeros((bsz, 1, h, w), dtype=torch.float, device=cuda0)
for i in range(bsz):
process(z_[i])
func(z)
the same thing will happen, first sample slow, then the rest is fast.
in the loop, i the only manipulated cuda tensor z. no external tensors are required.
i dont understand what’s blocking cuda, and why it needs to synch.
z
should be ready by then, and even creating zero tensor should be instant.
– creation of z and func are wrapped within torch.no_grad(). i tried with and without, i get the same pattern. here, i simplified func
, it is actually a class torch.nn.Module
, and i call forward.
any advice
thanks