GPU operations seem not asynchronous

Question: GPU operations are not asynchronous in my case.

Description:
I run something like
t = time.time()
loss = model(x)
loss.backward()
cost = time.time() - t
but I got almost the same result with/without torch.cuda.synchronize().
I have called .cuda() for model.(the model is on gpu)
There should be no gpu-cpu transfer(i.e. .cpu() or .gpu()) in model’s forward() method

It seems that GPU operations are not asynchronous in my case.
Why?
Or how can I check if I mistakely sync during model’s forward() method?

Some operations like .item() will add a synchronization point in your code.
Could you post the model definition, so that we could have a look for unwanted sync points?

Also, how large is your workload?

IIRC, backward is a synchronization point in pytorch.

I check all parts in my model by printing out their execution time without torch.cuda.synchronize().
One part with GRU and LayerNorm has a 100x more time cost than other part.

Code in this part is sth like:

v1 = self._gru(self._ln1(v1 + v0))
v2 = self._gru(self._ln2(v2 + v0))
v3 = self._gru(self._ln3(v3 + v0))

Here self._ln1 and self._ln2 and self._ln3 are instances of nn.LayerNorm
And self._gru is a Residual-GRU with code

class ResidualGRU(nn.Module):
    def __init__(self, hidden_size, dropout, num_layers):
        super(ResidualGRU, self).__init__()
        self.enc_layer = nn.GRU(input_size=hidden_size, hidden_size=hidden_size // 2, num_layers=num_layers,
                                batch_first=True, dropout=dropout, bidirectional=True)
        self.enc_ln = nn.LayerNorm(hidden_size)

    def forward(self, input):
        output, _ = self.enc_layer(input)
        return self.enc_ln(output + input)

May I ask if GRU will cause sync? Or what’s wrong with these code.

But I still got same result even if I separately get execution time for forward and backward.

And it seems that backward should be asynchronous since it is a gpu operation?

You cannot time separate modules without synchronization, as the modules will just accumulate the time when a sync was added, e.g. before the next iteration.
This is highly misleading and is thus not recommended.

Anyway, since your execution time is apparently the same, I would like to check the model.
Could you post the model definition and the shapes for the tensors so that we could run it?

The problem does not exist any more.
I did not “control variable” when compare the time cost between with/without synchronize. So sth like CPU load, samples order etc may affect the result.:pensive:
When I run code with/without CUDA_LAUNCH_BLOCKING=1 at the same time with the same random seed, the time cost are different.

Anyway, thank you very much. :wink: