Model training not asynchronous as expected

In various documentation it says cuda operation is supposed to be asynchronous, but I am not seeing such behavior. Simple code to reproduce:

import torch
import torch.nn as nn


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(1000, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 1000),
            nn.ReLU(),
        )

    def forward(self, x):
        self.net(x)
        return

model = Net()
model.cuda()

x = torch.randn(360, 1000).cuda()
y = torch.randn(360, 1000).cuda()

The following code run time is 1.2s.

from time import time
t = time()
for i in range(10000):
    model(x)
    # model(y)
print(time()-t)

While running the model forward twice each iteration results in 2.3s.

t = time()
for i in range(10000):
    model(x)
    model(y)
print(time()-t)

Since the model is quite small, apparently its forward pass won’t saturate the GPU. Also i don’t think there’s any synchronization point in the model, so I would expect a similar run time, not doubled.

Am I missing something? Thanks for any reply.

To see the async execution you could profile the workload with the PyTorch profiler or e.g. Nsight Systems and see if:

  • the workload on the GPU is large enough to let the CPU run ahead
  • if the CPU is blocking by some unexpected operations
  • if the actual kernel launches block the CPU due to a tiny workload (see first point)

A simple example is given here:

x = torch.randn(1024, 1024, 1024, device='cuda')

t0 = time.perf_counter()
y = torch.matmul(x, x)
t1 = time.perf_counter()
print('no sync {}'.format(t1 - t0))
# no sync no sync 0.00022597299539484084

torch.cuda.synchronize()
t0 = time.perf_counter()
y = torch.matmul(x, x)
torch.cuda.synchronize()
t1 = time.perf_counter()
print('sync {}'.format(t1 - t0))
# sync 0.07258635200560093