How to parallelize loading and computation

I wonder if PyTorch will automatically optimize loading and computation.
Here is a sample of the code.

import torch
A = torch.randn(1000,1000)
B = torch.randn(1000,1000)
C = torch.randn(1000,1000)
A = A.to("cuda",non_blocking = True)
D = B.matmul(C)

Will A = A.to("cuda",non_blocking = True) and D = B.matmul(C) executed together?

No, both will be added the the queue in the default CUDAStream and executed sequentially. You could use custom CUDAStreams to execute kernels in parallel (if possible) but would need to take care of needed synchronizations.

1 Like

Hi, thanks for the answer. However, I tried to use cuda.stream to parallize the two operations. I think it does not paralyze at all.
Here is the code for synchronize.

time_counter = 0.0
A = torch.randn(10000,10000).half()
B = torch.randn(10000,10000).half().to("cuda",non_blocking = False)
D = torch.randn(10000,10000).half().to("cuda",non_blocking = False)
for i in range(10):
    
    torch.cuda.synchronize()
    start_combine = time.time()
    
    A = A.to("cuda",non_blocking=True)
    F = B.matmul(D)
    torch.cuda.synchronize()
    end_combine = time.time()
    time_counter += end_combine - start_combine
    print("combine time: ", end_combine - start_combine)
    A = A.to("cpu")
print("avg combine time: ", time_counter/10)

The avg combine time is 0.06434.
The code for asynchronize for two streams is:

import time
import torch

time_counter = 0.0
A = torch.randn(10000, 10000).half()
B = torch.randn(10000, 10000).half().to("cuda", non_blocking=True)
D = torch.randn(10000, 10000).half().to("cuda", non_blocking=True)

# Create CUDA streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

for i in range(10):
    start_combine = time.time()

    with torch.cuda.stream(stream1):
        A = A.to("cuda", non_blocking=True)

    with torch.cuda.stream(stream2):
        F = B.matmul(D)

    torch.cuda.synchronize()
    end_combine = time.time()
    print("combine time: ", end_combine - start_combine)
    A = A.to("cpu")
print("avg combine time: ", time_counter / 10)

The avg time is 0.06712.
I do not think there is paralyzing on my gpu.

Also I have tested for loading and computation each
Code for computation.

time_counter = 0.0

for i in range(10):
    B = torch.randn(10000,10000).half().to("cuda")
    C = torch.randn(10000,10000).half().to("cuda")


    # torch.cuda.synchronize()
    start_compute = time.time()
    E = B.matmul(C)
    torch.cuda.synchronize()
    end_compute = time.time()
    time_counter += end_compute - start_compute
    print("compute time: ", end_compute - start_compute)
    del E
print("avg compute time: ", time_counter/10)

The avg time is 0.03173

Code for loading

import torch
import time

time_counter = 0.0
for i in range(10):
    A = torch.randn(10000,10000).half()
    torch.cuda.synchronize()
    start_load = time.time()
    A = A.to("cuda")
    torch.cuda.synchronize()
    end_load = time.time()
    print("load time: ", end_load - start_load)
    A = A.to("cpu")
    del A
    time_counter += end_load - start_load
print("average load time: ", time_counter/10)

The avg time is 0.03534

Profile the code via Nsight Systems and check if kernels are overlapping and by how much.

1 Like