Synchronizations themselves are not taking time, but are synchronizing with another process and would thus accumulating time. E.g. if your GPU is busy executing the forward pass of the model the CPU would have to synchronize and thus wait for the GPU if you are trying to print the output. E.g. here:
output = model(input) # executed on the GPU
print(output) # CPU synchronizes since the values of output are needed to be able to print them
the print
statement would need to synchronize with the GPU as it would otherwise just print uninitialized memory. In a profile the print
operation could thus accumulate the time of the forward pass and could look expensive, while in fact the forward pass itself takes the majority of the time.
Manual synchronizations via torch.cuda.synchronize()
are often used to properly profile the code.
Alternatively, also the PyTorch profiler or Nsight Systems can be used as they would show the timeline and you would be able to see how long each operation takes.
Usually the input size is small compared to the intermediate activations created during the forward pass and/or the model parameters. Take a look at this post which explains it in more details with some examples.