I wonder if PyTorch will automatically optimize loading and computation.
Here is a sample of the code.
import torch
A = torch.randn(1000,1000)
B = torch.randn(1000,1000)
C = torch.randn(1000,1000)
A = A.to("cuda",non_blocking = True)
D = B.matmul(C)
Will A = A.to("cuda",non_blocking = True)
and D = B.matmul(C)
executed together?
No, both will be added the the queue in the default CUDAStream
and executed sequentially. You could use custom CUDAStream
s to execute kernels in parallel (if possible) but would need to take care of needed synchronizations.
1 Like
Hi, thanks for the answer. However, I tried to use cuda.stream
to parallize the two operations. I think it does not paralyze at all.
Here is the code for synchronize.
time_counter = 0.0
A = torch.randn(10000,10000).half()
B = torch.randn(10000,10000).half().to("cuda",non_blocking = False)
D = torch.randn(10000,10000).half().to("cuda",non_blocking = False)
for i in range(10):
torch.cuda.synchronize()
start_combine = time.time()
A = A.to("cuda",non_blocking=True)
F = B.matmul(D)
torch.cuda.synchronize()
end_combine = time.time()
time_counter += end_combine - start_combine
print("combine time: ", end_combine - start_combine)
A = A.to("cpu")
print("avg combine time: ", time_counter/10)
The avg combine time
is 0.06434.
The code for asynchronize for two streams is:
import time
import torch
time_counter = 0.0
A = torch.randn(10000, 10000).half()
B = torch.randn(10000, 10000).half().to("cuda", non_blocking=True)
D = torch.randn(10000, 10000).half().to("cuda", non_blocking=True)
# Create CUDA streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()
for i in range(10):
start_combine = time.time()
with torch.cuda.stream(stream1):
A = A.to("cuda", non_blocking=True)
with torch.cuda.stream(stream2):
F = B.matmul(D)
torch.cuda.synchronize()
end_combine = time.time()
print("combine time: ", end_combine - start_combine)
A = A.to("cpu")
print("avg combine time: ", time_counter / 10)
The avg time is 0.06712.
I do not think there is paralyzing on my gpu.
Also I have tested for loading and computation each
Code for computation.
time_counter = 0.0
for i in range(10):
B = torch.randn(10000,10000).half().to("cuda")
C = torch.randn(10000,10000).half().to("cuda")
# torch.cuda.synchronize()
start_compute = time.time()
E = B.matmul(C)
torch.cuda.synchronize()
end_compute = time.time()
time_counter += end_compute - start_compute
print("compute time: ", end_compute - start_compute)
del E
print("avg compute time: ", time_counter/10)
The avg time is 0.03173
Code for loading
import torch
import time
time_counter = 0.0
for i in range(10):
A = torch.randn(10000,10000).half()
torch.cuda.synchronize()
start_load = time.time()
A = A.to("cuda")
torch.cuda.synchronize()
end_load = time.time()
print("load time: ", end_load - start_load)
A = A.to("cpu")
del A
time_counter += end_load - start_load
print("average load time: ", time_counter/10)
The avg time is 0.03534
Profile the code via Nsight Systems
and check if kernels are overlapping and by how much.
1 Like