In various documentation it says cuda operation is supposed to be asynchronous, but I am not seeing such behavior. Simple code to reproduce:
import torch import torch.nn as nn class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.net = nn.Sequential( nn.Linear(1000, 100), nn.ReLU(), nn.Linear(100, 100), nn.ReLU(), nn.Linear(100, 1000), nn.ReLU(), ) def forward(self, x): self.net(x) return model = Net() model.cuda() x = torch.randn(360, 1000).cuda() y = torch.randn(360, 1000).cuda()
The following code run time is 1.2s.
from time import time t = time() for i in range(10000): model(x) # model(y) print(time()-t)
While running the model forward twice each iteration results in 2.3s.
t = time() for i in range(10000): model(x) model(y) print(time()-t)
Since the model is quite small, apparently its forward pass won’t saturate the GPU. Also i don’t think there’s any synchronization point in the model, so I would expect a similar run time, not doubled.
Am I missing something? Thanks for any reply.