Assume two functions are depending on the tensor x: f1(x) and f2(x)
in f1, it works like:
void f1(x):
for layer in layers:
x=layer(x)
return x
Analogous to f2.
I found GPU utility is low to execute f1 and f2 in sequential.
As there is no data dependency for f1 and f2, can they run in parallel to fully leverage GPU?
I know Pytorch is running in async if there is no dataflow dependency. e.g. a1=linear(x); a2=linear(x). They could be executed asynchronizedly by pushing the operations into Cuda flow, right?
But there are data dependencies in the for loops in f1 and f2, could f1(x) and f2(x) be executed asynchronizedly? At least the profiler tell me they are executed in sequential.
I also tried run one of the functions by multi-thread, but the total execution time increased.
So is there any way like torch.jit.fork to run two functions in parallel under eager model during training?
Yes, potentially:
- in different
CUDAStream
s
- if the kernel execution time is longer than the CPU needs to schedule the kernel (could not be the case if the GPU workload is tiny or if your CPU is slow)
- if enough compute resources are available on the GPU.
PyTorch schedules the kernels and executes them asynchronously w.r.t. the host. Both layers will still be executed sequentially by default on the GPU unless you use custom CUDAStream
s as described before.
Since the same, default, CUDAStream
is used there are no issues with data dependencies.
Multithreading in Python is often a bad idea due to the GIL.
Many thanks for your reply! I’ll give a try~
Sorry to bother you again. I tried your suggestion, but I found the total training time increased. I instantiate a CUDA stream instance as follow:
class A(nn.Module):
def __init__(self):
...
self.parallel_stream = torch.cuda.Stream()
def forward(self, x):
with torch.cuda.stream(self.parallel_stream):
loss1 = func1(x)
loss2 = func2(x)
torch.cuda.synchronize()
return loss1 + loss2
Is it possible that the time the CPU needs to schedule the kernel is close to the kernel execution time? The model logic is pretty complicated inside func1 as it works in a dynamic programming way. But there are no operations that need to synchronize with GPU like tensor.to(device). Is it possible to launch func1 in another stream in a real new thread without GIL (e.g. with the help of C++ of cython) ?
My environment: A100 + Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz
Profile your approach via Nsight Systems and check the kernel execution and overlap on the timeline view.