I am facing a very starnge problem with torch.nn.DataParallel()
. I have a system with 8 GPUs and I want to use multiple GPUs for training my model. Now when I wrap the model with nn.DataParallel, it works only for batch_size 10! This is very odd because for any batch size other than 10 ( even smaller), the execution just gets stuck. When I am not using parallelism and running on single GPU, it is working properly. But for batch_size more than 16, cuda is running out of memory because my input vectors are very large and model is very big. So I am unable to take advantage of multiple GPUs. Any soltuion out there? Thank you in advance…
1 Like
Batch size 10 is odd indeed. I would have expected it to only work with a multiple of 8, if you’re using 8 devices. What kind of model are you trying to parallelize?
A transformer model. It has 12 layers of encoder blocks.
If the batches are asymmetric in size then it is possible that some devices can handle 2 examples where others can’t. Not much that can be done about this, save for memory profiling to prove that this is what’s happening.