Yes it is possible. Just put some of the layers in GPU0 (.cuda(0)) and others on GPU1 (.cuda(1)). Then, in the forward function, once the base on the first GPU finishes processing, call .cuda(1) on the output. Of course this can be extended to as many GPUs as you want. See an example below.
No. Calling .cuda(i) on a CUDA tensor that’s on GPU j (j != i) is purely a peer to peer copy. Host doesn’t have to do anything.
class MyModel(nn.Module):
def __init__(self, split_gpus):
self.large_submodule1 = ...
self.large_submodule2 = ...
self.split_gpus = split_gpus
if split_gpus:
self.large_submodule1.cuda(0)
self.large_submodule1.cuda(1)
def forward(self, x):
x = self.large_submodule1(x)
if split_gpus:
x = x.cuda(1) # P2P GPU transfer
return self.large_submodule2(x)