Hi! I have a model that is too large to fit inside a single TITAN X (even with 1 batch size). I want to split it over several GPUs such that the memory cost is shared between GPUs. That is, place different parts of the same model on different GPUs and train it end-to-end.
Is this possible in pyTorch? If not, is this possible in Torch?
Would inter-GPU communication (say, for transferring activations to later layers) involve GPU->host->GPU type transfers?
Yes it is possible. Just put some of the layers in GPU0 (.cuda(0)) and others on GPU1 (.cuda(1)). Then, in the forward function, once the base on the first GPU finishes processing, call .cuda(1) on the output. Of course this can be extended to as many GPUs as you want. See an example below.
No. Calling .cuda(i) on a CUDA tensor that’s on GPU j (j != i) is purely a peer to peer copy. Host doesn’t have to do anything.
def __init__(self, split_gpus):
self.large_submodule1 = ...
self.large_submodule2 = ...
self.split_gpus = split_gpus
def forward(self, x):
x = self.large_submodule1(x)
x = x.cuda(1) # P2P GPU transfer
I notice that when I split the whole model in 4 gpus and do forward/backward, the GPU memory on the fisrt GPU cost much more than it should be. For example, if the whole model cost 12GB on a single GPU, when split it to four GPUs, the first GPU cost 11GB and the sum of others cost about 11GB.
Is there an explaination for how does the GPU memory be malloced when using multiple GPUs for model parallelism.
Another question, when forward with the model parallelism, there is only one gpu hasing the Volatile GPU-Util with 100%, the others are 0%.
Is there any method to leverage all GPU-Util with the all four GPUs?
I am also interested in late fusion and running particular submodels in different GPUs. Did someone find out anything on how to do it? I am currently doing something similar as @Hengck suggested but it is not working.
I am trying to implement inter-GPU communiation by using pytorch+mpi+gpu.
Following are the tested code, which is designed to make sure that process0 runs on GPU0 and process1 runs on GPU1. However, the code can not be run successfully. Do you know why?
import torch.distributed as dist
from torch.multiprocessing import Process
def run(rank, size):
if rank == 0:
tensor = torch.zeros(1).cuda(0)
# Send the tensor to process 1
tensor += 1
tensor = torch.zeros(1).cuda(1)
# Receive tensor from process 0
print('Rank ', rank, ’ has data ', tensor)
“”" Initialize the distributed environment. “”"
rank = dist.get_rank()
size = dist.get_world_size()
print('I am rank ', rank, ’ on ', platform.node())
Would this also work for one 1 gpu with two sequential steps somehow? If my model is too large to fit on one gpu can I somehow do the forward/backward pass sequtially where I only have one part in gpu memory and somehow cache the other part for the backward pass later.
Somehow like this:
x = submodule1(x)
#somehow unload intermediate results of submodule1 from gpu here and cache for later backward pass
#(and then load on gpu again when needed in backward pass of submodule1)
x = submodule2(x)
I could imagine how this works but then I don’t know how I would pass the gradients that come from submodule2 back to submodule1 and initiate the backward pass on submodule1.