Question: GPU operations are not asynchronous in my case.
I run something like
t = time.time()
loss = model(x)
cost = time.time() - t
but I got almost the same result with/without
I have called
.cuda() for model.(the model is on gpu)
There should be no gpu-cpu transfer(i.e.
.gpu()) in model’s
It seems that GPU operations are not asynchronous in my case.
Or how can I check if I mistakely sync during model’s
Some operations like
.item() will add a synchronization point in your code.
Could you post the model definition, so that we could have a look for unwanted sync points?
Also, how large is your workload?
IIRC, backward is a synchronization point in pytorch.
I check all parts in my model by printing out their execution time without
One part with GRU and LayerNorm has a 100x more time cost than other part.
Code in this part is sth like:
v1 = self._gru(self._ln1(v1 + v0))
v2 = self._gru(self._ln2(v2 + v0))
v3 = self._gru(self._ln3(v3 + v0))
self._ln3 are instances of
self._gru is a Residual-GRU with code
def __init__(self, hidden_size, dropout, num_layers):
self.enc_layer = nn.GRU(input_size=hidden_size, hidden_size=hidden_size // 2, num_layers=num_layers,
batch_first=True, dropout=dropout, bidirectional=True)
self.enc_ln = nn.LayerNorm(hidden_size)
def forward(self, input):
output, _ = self.enc_layer(input)
return self.enc_ln(output + input)
May I ask if GRU will cause sync? Or what’s wrong with these code.
But I still got same result even if I separately get execution time for
And it seems that
backward should be asynchronous since it is a gpu operation?
You cannot time separate modules without synchronization, as the modules will just accumulate the time when a sync was added, e.g. before the next iteration.
This is highly misleading and is thus not recommended.
Anyway, since your execution time is apparently the same, I would like to check the model.
Could you post the model definition and the shapes for the tensors so that we could run it?
The problem does not exist any more.
I did not “control variable” when compare the time cost between with/without synchronize. So sth like CPU load, samples order etc may affect the result.
When I run code with/without
CUDA_LAUNCH_BLOCKING=1 at the same time with the same random seed, the time cost are different.
Anyway, thank you very much.