Is there a non-blocking LSTM?

As I learned from the document, “By default, GPU operations are asynchronous. When you call a function that uses the GPU, the operations are enqueued to the particular device, but not necessarily executed until later. This allows us to execute more computations in parallel, including operations on CPU or other GPUs.”

However, I find that LSTM doesn’t seem to be asynchronous. I don’t know why. See the examples below.

LSTM:

a = torch.rand(1000, 20, 10000).to(device)
net = nn.LSTM(10000, 100).to(device)
torch.cuda.synchronize()
t = time.time()
with torch.no_grad():
    for i in range(10):
        c = net(a)
print(time.time() - t)
torch.cuda.synchronize()
print(time.time() - t)
0.3161756992340088
0.3302645683288574

Linear:

a = torch.rand(1000, 20, 10000).to(device)
net = nn.Linear(10000, 100).to(device)
torch.cuda.synchronize()
t = time.time()
with torch.no_grad():
    for i in range(10):
        c = net(a)
print(time.time() - t)
torch.cuda.synchronize()
print(time.time() - t)
0.0007715225219726562
0.02486276626586914

To get some more information about both workloads, I would recommend to profile them via e.g. Nsight Systems and check the timelines for both workloads. This would allow you to see any synchronizations and how how many kernels are executed etc.