GPU operations seem not asynchronous

unbreading · December 7, 2019, 1:59pm

Question: GPU operations are not asynchronous in my case.

Description:
I run something like
t = time.time()
loss = model(x)
loss.backward()
cost = time.time() - t
but I got almost the same result with/without torch.cuda.synchronize().
I have called .cuda() for model.(the model is on gpu)
There should be no gpu-cpu transfer(i.e. .cpu() or .gpu()) in model’s forward() method

It seems that GPU operations are not asynchronous in my case.
Why?
Or how can I check if I mistakely sync during model’s forward() method?

ptrblck · December 7, 2019, 8:09pm

Some operations like .item() will add a synchronization point in your code.
Could you post the model definition, so that we could have a look for unwanted sync points?

Also, how large is your workload?

SimonW · December 7, 2019, 10:03pm

IIRC, backward is a synchronization point in pytorch.

unbreading · December 8, 2019, 4:29am

I check all parts in my model by printing out their execution time without torch.cuda.synchronize().
One part with GRU and LayerNorm has a 100x more time cost than other part.

Code in this part is sth like:

v1 = self._gru(self._ln1(v1 + v0))
v2 = self._gru(self._ln2(v2 + v0))
v3 = self._gru(self._ln3(v3 + v0))

Here self._ln1 and self._ln2 and self._ln3 are instances of nn.LayerNorm
And self._gru is a Residual-GRU with code

class ResidualGRU(nn.Module):
    def __init__(self, hidden_size, dropout, num_layers):
        super(ResidualGRU, self).__init__()
        self.enc_layer = nn.GRU(input_size=hidden_size, hidden_size=hidden_size // 2, num_layers=num_layers,
                                batch_first=True, dropout=dropout, bidirectional=True)
        self.enc_ln = nn.LayerNorm(hidden_size)

    def forward(self, input):
        output, _ = self.enc_layer(input)
        return self.enc_ln(output + input)

May I ask if GRU will cause sync? Or what’s wrong with these code.

unbreading · December 8, 2019, 4:35am

But I still got same result even if I separately get execution time for forward and backward.

And it seems that backward should be asynchronous since it is a gpu operation?

ptrblck · December 8, 2019, 6:01am

You cannot time separate modules without synchronization, as the modules will just accumulate the time when a sync was added, e.g. before the next iteration.
This is highly misleading and is thus not recommended.

Anyway, since your execution time is apparently the same, I would like to check the model.
Could you post the model definition and the shapes for the tensors so that we could run it?

unbreading · December 9, 2019, 12:44pm

The problem does not exist any more.
I did not “control variable” when compare the time cost between with/without synchronize. So sth like CPU load, samples order etc may affect the result.
When I run code with/without CUDA_LAUNCH_BLOCKING=1 at the same time with the same random seed, the time cost are different.

Anyway, thank you very much.