The two modules x1
and x2
in the example will run sequentially on the same CUDA stream.
Kernels launched on the same stream are not pipelined. All blocks from the first kernel must complete before any blocks from the second kernel can be issued. This is the most basic level of synchronization provided by CUDA.
In order to pipeline two modules, PyTorch would need to run the underlying CUDA kernels on different streams. Then blocks from the second module could begin issuing as soon as all blocks from the first module have issued.
There is a PyTorch API for this. Apparently it respects the stream assignments during the backwards pass, too. See https://github.com/pytorch/pytorch/pull/8354
One could imagine a simpler API: Create a module called Parallel
that takes a list of of modules and runs them each on a separate stream, and synchronizes at the end. It would looks like this:
def __init__(self, x):
...
self.branches = nn.Parallel(block1, block2)
...
def forward(self, x):
x1, x2 = self.branches(x, x)
x = torch.mm(x1, x2)
return x
You could build large parallel branches by composing Parallel
with Sequential
.
I think this API might be simpler than using streams directly, because we usually don’t care which streams are actually used for different modules, we just want to indicate the parallelism in the model and have the framework exploit it automatically.
Anyway hope this helps somebody.