The tensors might not be
torch.cuda.FloatTensor at first. So to be sure I call
.cuda() on them. Why does this cause issues with DataParallel?
Is it better if I convert the tensors like this?
device = torch.device('cuda')
input = torch.tensor(input, device=device)
Ohh I will give it try and see if that works for me. Thanks!
The data should be pushed onto the same GPU as your
nn.DataParallel was pushed to. However, this is usually done before feeding the data into the model, since
DataParallel will scatter the data onto each specified GPU.
Currently you are using
.cuda() inside your
loss() method (which seems to be similar to the
Could you remove this
.cuda call and use it outside of your model?
Ok, I have a function in the model class which splits the batch into inputs and labels, then converts them to cuda tensors.
def loss(self, batch, alpha):
input, labels = collate(*batch) # Previously called like this
loss1 = self.b1_forward(input)
loss2 = self.b2_forward(input, alpha)
return loss1, loss2
def collate(self, inputs, labels):
# zero pad concatenate the inputs in the batch
inputs = torch.tensor(inputs, device=device)
labels = torch.tensor(labels, device=device)
return inputs, labels
Now I am calling it like this:
model = nn.DataParallel(model)
model = model.cuda()
x, y = model.module.collate(*batch)
loss1, loss2 = model.module.loss(x, y, alpha)
I changed the signature of loss function to accept already collate-ed input pairs but still I get out-of-memory error when GPU0 fills up.
The second GPU is not being used
Thanks @ptrblck for taking the time to answer this. It is the first google result and really helpful.
I am also having some issue understanding some details here.
torch.device('cuda') == torch.device('cuda:0') ?
- Assuming you are using an iterator (ie.
from torchtext.data import Iterator), should you specify in the iterator that the device is cuda (ie.
- When you move your data to gpu before sending it to the model (that will use multiple gpu) with
input.to(torch.device("cuda:0")) aren’t you overloading the first gpu? By overloading I mean having the first gpu using more memory than the other gpu and therefore reducing how big the batch size could be if you were to send each chunk of the batch directly to each gpu?
- The documentation recommend using Multi-Process Single-GPU instead of
nn.DataParallel for better performance. However, there isn’t any example on how to do it. Could you show me how you would do it on the simple example I am adding at the end?
- Should the learning rate be adapted to the number of gpu as well?
This is the highly recommended way to use
DistributedDataParallel , with multiple processes, each of which operates on a single GPU. This is currently the fastest approach to do data parallel training using PyTorch and applies to both single-node(multi-GPU) and multi-node data parallel training. It is proven to be significantly faster than
torch.nn.DataParallel for single-node multi-GPU data parallel training.
Here is the example, I’d love if someone could refactor it for question 5 and I think it might help a few people:
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
# Parameters and DataLoaders
input_size = 5
output_size = 2
batch_size = 30
data_size = 100
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
# Our model
def __init__(self, input_size, output_size):
self.fc = nn.Linear(input_size, output_size)
def forward(self, input):
output = self.fc(input)
print("\tIn Model: input size", input.size(),
"output size", output.size())
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
for data in rand_loader:
input = data.to(device)
# It seems to me that we are only pushing the data to the first cuda (cuda:0). How does this run in multigpu?
# I am guessing it works but I find this really not intuitive since you push the data to one gpu to get it trained on all gpu?
output = model(input)
print("Outside: input size", input.size(),
@ptrblck is this an absolute requirement to have num_workers=0 for multiple GPUs training?
No, it’s not a requirement. Do you see any issues using multiple workers?
It is probably not the source of my problem. Thanks for the quick reply. I’ll post a code snippet here if I don’t solve this in the next hour.
@ptrblck from what I understand as of now, and after trial and errors + reading this quote:
Data Parallelism is when we split the mini-batch of samples into multiple smaller mini-batches and run the computation for each of the smaller mini-batches in parallel.
When you wrap your model in nn.DataParallel, the big idea is that you can increase your batch size without increasing your training time per batch. Say you have one GPU training a batch size of 16, it will approximately take the same time for 8 similar GPUs to train a batch size of 128 (16*8).
Is that line of reasoning correct?
It also seems that the number of workers for the dataloader can play on the data loading bottleneck, thus training time. When I was using 20 workers on a 20 CPUs+8*V100 on GCP/Paperspace it was training slower (but I can’t tell the exact reason). Once I reduced the workers to 15, the training time per epoch was reduced by 4x.
That would be the ideal linear scaling you could achieve, thus reducing the epoch time by number of GPUs.
Too many CPU workers might slow down the data loading. I’m not an expert on this topic, but always refer to @rwightman’s post.
I am working on video recognition and my each batch size is roughly around (150,3,224,224). I have 4 GPU, if I use dataparallel it will split the batch size. How to solve the problem when the single batch is too big.
If you need this batch size, you could try to trade compute for memory using checkpoint.
I haven’t tried it with
nn.DataParallel yet, but it should work.
Thank you for ur nice answers, but I still have a problem when using pytorch multiple gpus.
I get very imbalanced gpu memery usage. when I want to use larger batch_size, I will get “OUT OF MEMORY” problem.
And I am very sure my code is right.(I follow the instructions of the pytorch tutorial for multiple gpus)
What can I do to fully utilize the GPU memories?
The usage seems to be way too imbalanced for a typical
nn.DataParallel use case.
In my previous post I mentioned the blog post in point 4, which explains the imbalance in memory usage, however in your current setup it looks like device1-3 are also creating the CUDA context.
Are you seeing any usage in the
GPU-Util section of
I’m having exactly this same issue. I’m trying to parallelize across 2 GPUs but only one is showing high memory usage (say 23000MiB) and the other one 11MiB (basically nothing).
I’m also implementing correctly the nn.DataParallel(model) from the tutorial.
Were you able to find a workaround for this?
When I do this DataParallel to make my model run on two GPUs, my model is getting changed. I mean, the children structure of my model is getting changed. With single GPU, I could see two children, but after using DataParallel I could see only one child of the model.
Can some one please clarify on this.
nn.DataParallel wraps the model into
model.module. Could this explain the observed change?
I tried to change the number of frozen layers of vgg16 model. When I used one GPU, I could see that model has two children and I could even fine-tune only certain layers. But, when I used that nn.DataParallel, I am not able to see the same and could not fine-tune some layers. Please let me know the solution if any.
Could you post some code showing, how you are freezing the layers and what doesn’t work in