Question: GPU operations are not asynchronous in my case.
Description:
I run something like t = time.time() loss = model(x) loss.backward() cost = time.time() - t
but I got almost the same result with/without torch.cuda.synchronize().
I have called .cuda() for model.(the model is on gpu)
There should be no gpu-cpu transfer(i.e. .cpu() or .gpu()) in model’s forward() method
It seems that GPU operations are not asynchronous in my case.
Why?
Or how can I check if I mistakely sync during model’s forward() method?
Some operations like .item() will add a synchronization point in your code.
Could you post the model definition, so that we could have a look for unwanted sync points?
I check all parts in my model by printing out their execution time without torch.cuda.synchronize().
One part with GRU and LayerNorm has a 100x more time cost than other part.
You cannot time separate modules without synchronization, as the modules will just accumulate the time when a sync was added, e.g. before the next iteration.
This is highly misleading and is thus not recommended.
Anyway, since your execution time is apparently the same, I would like to check the model.
Could you post the model definition and the shapes for the tensors so that we could run it?
The problem does not exist any more.
I did not “control variable” when compare the time cost between with/without synchronize. So sth like CPU load, samples order etc may affect the result.
When I run code with/without CUDA_LAUNCH_BLOCKING=1 at the same time with the same random seed, the time cost are different.