Which function can I benefit from asynchronous GPU operation?

From CUDA semantics — PyTorch 2.1 documentation , it’s stated that GPU operation is asynchronous.

By default, GPU operations are asynchronous. When you call a function that uses the GPU, the operations are enqueued to the particular device, but not necessarily executed until later. This allows us to execute more computations in parallel, including operations on CPU or other GPUs.

However, when I run the following simple code, the elapsed time was same.

So I guess only some of GPU operations can benefit from asynchronous, not all operations.

Is it right assumption? Or moreover, should I make GPU operations be asynchronous by threading or multiprocessing (just like nn.DataParallel) ?

import time
import torch

def foo(idx):
    with torch.cuda.device(idx):
        a = torch.randn(1000, 1000)
        b = torch.randn(1000, 1000)
        a *= b

def main():
    st = time.time()
    for _ in range(10):
        foo(0)
    print(time.time() - st)

    time.sleep(10)

    st = time.time()
    for i in range(10):
        foo(i%2)
    print(time.time() - st)

if __name__ == '__main__':
    main()

Since you are trying to time CUDA operations, you should synchronize before starting and stopping your timer using torch.cuda.synchronize().

torch.cuda.synchronize()
st = time.time()
...
torch.cuda.synchronize()
print(time.time() - st)

Currently you are just seeing the time to launch the kernels.