I am trying to build a network in Pytorch with a very large fully connected top layer, on the order of input 80000, output 15000 elements (there are more layers after).
The top layer alone requires too much CUDA memory to fit on one GPU during training (even with batch size 1).
Pytorch’s DataParallel for GPU splitting still needs to put the model on all GPUs (as far as I know), and gather the data on one at the end, so doesn’t help me with the memory issue.
Other posts discuss splitting a large model onto several GPUs (e.g. Split single model in multiple gpus). From what I understand, using something like (from linked post):
self.large_submodule1.cuda(0) self.large_submodule1.cuda(1) ...
only seems to be relevant when you can split whole modules, e.g. the first whole FC layer, the second FC layer etc. I don’t know if and how it can be used for splitting one module, here one FC layer? In the MWE below, that would mean splitting fc onto multiple GPUs.
class network(nn.Module): def __init__(self,sizeIn,sizeOut): super(network, self).__init__() ### Linear FC layers self.fc = nn.Sequential( nn.Linear(sizeIn, sizeOut), nn.Tanh(), ) def forward(self, x): x = self.fc(x) return x
Any tips are much appreciated! Thanks.