Why `torch.histc` is slow on first sample?

torch.histc performed over a minibatch cuda tensor of n samples is slow on first sample…
is that normal?

torch.histc takes way too long on the first sample, .34sec
and 0.000163sec on the left.

any idea why? and how to speedup this?
the code is in the middle of large code where warmup has already been done…
thanks

Could you describe how you’ve profiled the code, please?
Often profiling code misses e.g. synchronizations and could thus accumulate previously launched kernel times into the current timer and would thus return wrong timings.

i re-run the timing with cuda.event, and i get relatively the same time except the first sample when using python time i get way higher time than torch.cuda.event. i dont know why is that.
not sure what python time was computing? and why this huge time at the first sample? this happens constantly on every first sample of any minibatch.

what is the true time that my code will take (it cant be 2 different times)? is it the one computed with torch.cuda.event or python time? if python time is taking this long, someone must be doing something that torch.cuda.event didnt take in consideration… it could be something to do with synchronization. when serializing the execution, i get consistent time in both timers.

could torch.histc creates a synchronization point at the first sample allowing all the kernels over other samples to be done, therefore it is fast on the rest of samples but not the first one?

the goal of this measurements is to find bottlenecks to improve their speed and speedup the full code.
in this case, for every minibatch, 99% of time is spent on first sample everytime. it is fine if this happens only once. but this happens in every first sample of every minibatch. so, it costs.

thanks

here is all the timings:

  • time using time.perf_counter (s)
  1. with CUDA_LAUNCH_BLOCKING=1
sample 1 histc 0.0004032440483570099 
sample 2 histc 0.0005275830626487732  
sample 3 histc 0.000552598387002945 
  1. without CUDA_LAUNCH_BLOCKING=1
sample 1 histc 0.3368970938026905  **WHY?** 
sample 2 histc 0.00043926388025283813 
sample 3 histc 0.0003735348582267761 
  • time using torch.cuda.Event (ms)
  1. with CUDA_LAUNCH_BLOCKING=1
sample 1 histc 0.4039680063724518
sample 2 histc 0.6740800142288208
sample 3 histc 5.139200210571289
  1. without CUDA_LAUNCH_BLOCKING=1
sample 1 histc 0.3036479949951172
sample 2 histc 0.47996801137924194
sample 3 histc 0.5324159860610962

general setup:

def function(tensor):
    # operations ...
    # tx = time.perf_counter()
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()

    hist = torch.histc(x, bins=bins)
    
    end_event.record()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    print('histc {}'.format(elapsed_time_ms))
    # print('histc {} '.format(time.perf_counter() - tx))
  
for epoch in epochs:
  forward(minibatch)
  # prepare tensors
  l_tensors = None
  for t in range(minibatch_size):
    l_tensors.vstack((l_tesnors, compute(tensor_t))
   
  # process all tensors.
  for tensor in l_tensors:
    function(tensor) # <-- where we measure time.

Hi the pytorch profiler might be beneficial to you torch.profiler — PyTorch 1.9.0 documentation. Also you can use the trace functionality to identify bottlenecks in your code PyTorch Profiler — PyTorch Tutorials 1.9.0+cu102 documentation.

i think i arrived to timing the histc function specifically and its parent function where it was called through the pytorch profiler.
it doesnt detail all the functions i guess. so, i had to manually further time some functions.
the profiler also showed that min and max tensor take the cake in time. but i think it is because i used them in the evaluation of the condition in if which is blocking cause of gpu-2-cpu transfer.

Based on your description you might profile the CUDA init in the first iteration using the Python timer or the “wake up cycle” in case the GPU was idle before, so I would recommend to use torch.utils.benchmark to profile specific ops, as it’ll take care of the warmup iterations as well as necessary synchronizations.

in all the results above, torch.utils.benchmark is set to true.

the slowness over the first sample of minbatch is recurrent over all minibatchs not just the first one.
so, the warmup couldnt be the issue.

i did some more digging, and now i am sure that the slowness is caused by cuda implicit synchronization. it is not torch.histc, tensor.min/max, or python if.

to clarify, i work on a tensor z of size (32, 1, h, w) that is loaded from disk using pytorch dataloader sample by sample with multiple workers and stitched together the same way we do with loading images to form a minibatch.

after loading our tensor z, i do the only following cuda operations on it very early in the code:

# z is loaded using datalaoder
with torch.no_grad():
  # z: (bsz, 1, h, w)
  assert z.ndim == 4
 # Quick fix
 z = torch.nan_to_num(z, nan=0.0, posinf=1., neginf=0.0)
 z = F.interpolate(z,
                   image_size,
                   mode='bilinear',
                   align_corners=False)  # (bsz, 1, h, w)

and that is all.

after this, i do some other stuff that do not involve z at all, such as forward in model.
then, i call a function that iterate on each sample in z and do some cuda stuff.
what i learned is that the FIRST CUDA OPERATION ON THE FIRST SAMPLE will cause A CUDA IMPLICIT SYNCHRONIZATION. either it is z.min()/max, torch.histc, python if min=max, … it does not matter. the moment cuda is involved, it calls synch. it is like kernels on z are not done!!!

# prepare z
# other stuff that do not involve z...

def func(z_):
  for i in range(bsz):
     process(z_[i])

func(z)

ok, let say they are not done.
if and only if add torch.cuda.synchronize() only right before starting the loop, all samples will run in .0003sec otherwise, the first will run in .33se while the rest in .0003sec. meaning the synch works, and z is ready.
but if ask for synch right after preparing z, it wont work!!!
later when staring working on z, cuda will synch on its own… it is so weird.

# prepare z
# torch.cuda.synchronize()  # does not work here. we get the same slow behavior.
# other stuff that do not involve z...

def func(z_):
  torch.cuda.synchronize() # will work here. z will be ready and no need for cuda implicit synch.
  for i in range(bsz):
     process(z_[i])

func(z)

this implicit cuda synch happens even if i create z as zero right before the loop such

# prepare z

# other stuff that do not involve z...

def func(z_): 
  # forget about our z... lets create one new one full of 0.
  z_ = torch.zeros((bsz, 1, h, w), dtype=torch.float, device=cuda0)
  for i in range(bsz):
     process(z_[i])

func(z)

the same thing will happen, first sample slow, then the rest is fast.
in the loop, i the only manipulated cuda tensor z. no external tensors are required.
i dont understand what’s blocking cuda, and why it needs to synch.
z should be ready by then, and even creating zero tensor should be instant.

– creation of z and func are wrapped within torch.no_grad(). i tried with and without, i get the same pattern. here, i simplified func, it is actually a class torch.nn.Module, and i call forward.

any advice
thanks

torch.utils.benchmark is a utility function used to profile operations and is not a flag, so setting it to True won’t do anything besides overriding it.

Based on your description, you are getting the right profiling by synchronizing before starting and stopping the timers and get the expected wrong results otherwise.

did more digging.
didnt arrive to the exact cause of the problem, until now.

note, i have a network where i perform a forward for training over minibatch x, then i have the part of the code that processes z.

the network forward operates directly on minibatch (parallel).
however, the part operating on z, works sequentially, only at the end, i gathered what batchable into single cuda operation over entire batch.

initially, i perform:

network(x)
process(z)

this causes the discussed issue when arriving to process(z) where first sample is very slow .33sec, then rest of samples are fast .003 with mintach times to .45sec.

then, i switched the order to

process(z)
network(x)

things works properly. problem ‘disappeared’ at process(z). all samples in z are processed in .003 and total z in .11sec without an explicit torch.cuda.synchronize().

but that’s not all.

network(x) is completely independent from process(z) in term of data or functions. so, i dont know how the order of executing kernels will affect them. also, this doesnt necessarily mean the problem is solved because the overall time of 1 epoch didnt improve much (from 2min14s to 2min03s, i expected more gain, like .33se * 188 minibatches). cuda is probably doing some implicit synch.

so, did some more digging. it turns out i was wrong about the time of network(x). it is not .001, but .33sec. i used the wrong tool to measure time cuda stuff using python time.

so, the extra .33sec were coming from the network forward and not other operations.
the order does not matter.
cuda needs to synch to get the model output causing the delay.

sorry for the inconvenience, and thanks for your help.

sorry, i was thinking of torch.backends.cudnn.benchmark which is flag.