From CUDA semantics — PyTorch 2.1 documentation , it’s stated that GPU operation is asynchronous.
By default, GPU operations are asynchronous. When you call a function that uses the GPU, the operations are enqueued to the particular device, but not necessarily executed until later. This allows us to execute more computations in parallel, including operations on CPU or other GPUs.
However, when I run the following simple code, the elapsed time was same.
So I guess only some of GPU operations can benefit from asynchronous, not all operations.
Is it right assumption? Or moreover, should I make GPU operations be asynchronous by threading or multiprocessing (just like nn.DataParallel) ?
import time
import torch
def foo(idx):
with torch.cuda.device(idx):
a = torch.randn(1000, 1000)
b = torch.randn(1000, 1000)
a *= b
def main():
st = time.time()
for _ in range(10):
foo(0)
print(time.time() - st)
time.sleep(10)
st = time.time()
for i in range(10):
foo(i%2)
print(time.time() - st)
if __name__ == '__main__':
main()