Model training not asynchronous as expected

In various documentation it says cuda operation is supposed to be asynchronous, but I am not seeing such behavior. Simple code to reproduce:

import torch
import torch.nn as nn


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(1000, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 1000),
            nn.ReLU(),
        )

    def forward(self, x):
        self.net(x)
        return

model = Net()
model.cuda()

x = torch.randn(360, 1000).cuda()
y = torch.randn(360, 1000).cuda()

The following code run time is 1.2s.

from time import time
t = time()
for i in range(10000):
    model(x)
    # model(y)
print(time()-t)

While running the model forward twice each iteration results in 2.3s.

t = time()
for i in range(10000):
    model(x)
    model(y)
print(time()-t)

Since the model is quite small, apparently its forward pass won’t saturate the GPU. Also i don’t think there’s any synchronization point in the model, so I would expect a similar run time, not doubled.

Am I missing something? Thanks for any reply.

To see the async execution you could profile the workload with the PyTorch profiler or e.g. Nsight Systems and see if:

  • the workload on the GPU is large enough to let the CPU run ahead
  • if the CPU is blocking by some unexpected operations
  • if the actual kernel launches block the CPU due to a tiny workload (see first point)

A simple example is given here:

x = torch.randn(1024, 1024, 1024, device='cuda')

t0 = time.perf_counter()
y = torch.matmul(x, x)
t1 = time.perf_counter()
print('no sync {}'.format(t1 - t0))
# no sync no sync 0.00022597299539484084

torch.cuda.synchronize()
t0 = time.perf_counter()
y = torch.matmul(x, x)
torch.cuda.synchronize()
t1 = time.perf_counter()
print('sync {}'.format(t1 - t0))
# sync 0.07258635200560093
1 Like

@ptrblck

import torch
import time
from torchvision import models

device = torch.device("cuda:1")
model = models.resnet50().to(device).eval()

def async_resnet50():
    with torch.no_grad():
        img = torch.randn(1, 3, 224, 224, device=device, dtype=torch.float32)
        t1 = time.perf_counter()
        y = model(img)
        t2 = time.perf_counter()
        torch.cuda.synchronize()
    return t2 - t1


def sync_resnet50():
    with torch.no_grad():
        img = torch.randn(1, 3, 224, 224, device=device, dtype=torch.float32)
        t1 = time.perf_counter()
        y = model(img)
        torch.cuda.synchronize()
        t2 = time.perf_counter()
    return t2 - t1


def run_test(run, rounds=200):
    t_total = 0
    for _ in range(rounds):
        t_ = run()
        t_total += t_
    return t_total / rounds

if __name__ == "__main__":
      run_test(sync_resnet50) # warm up
      t1 = run_test(sync_resnet50)
      t2 = run_test(async_resnet50)
      print(t1, t2)
# 0.0012210424989461898 0.0012605752144008876

torchvision model does not seem to be running asynchronously as expected, is that normal ?``

Your workload might be CPU limited and thus the kernel scheduling might not be fast enough. Also you code will implicitly synchronize if the kernel queue is already saturated.