Hey guys ,
is it in general possible to use the data.parallel wrapper if you got two different gpus
thanks Tobi
Hey guys ,
is it in general possible to use the data.parallel wrapper if you got two different gpus
thanks Tobi
Yes, that’s possible.
However, you will get a warning, if there is an imbalance in the GPU memory (one has less memory than the other).
Also, your performance should depend on the slowest GPU you are using, so it might not be recommended, if you are using GPUs with a very different performance profile.
thanks for the reply,
I got another question it seems like that the two gpus get the same distributions of the data , if the dataset is to large then i got a cuda out off memory error, but for small dataset this is no problem . Do you have any idea how to fix this?
best, Tobi
Have a look at this blog post to see how nn.DataParallel works internally and how to counter some effects of an imbalanced memory usage.
PS: you could also try out nn.DistributedDataParallel, which shouldn’t introduce the imbalance.
Apologies for the old thread resurrection here.
If I have a single machine with 2 mismatched GPUs (1 fast with big memory, 1 slow with small memory), DataParallel will only go as fast as the slow GPU with less memory, making it not worth using - as was covered above.
Does DistributedDataParallel on a single machine with 2 mismatched GPUs make sense? Or am I better off simply using the single fast GPU?
I’m trying to understand the usages of DistributedDataParallel (SPSD, SPMD, etc.)
I’ve been trying to read this and getting a bit confused: [POLL][RFC] Can we retire Single-Process Multi-Device Mode from DistributedDataParallel? · Issue #47012 · pytorch/pytorch · GitHub
Both parallel approaches would suffer from the slow GPU, so you would have to check the actual performance using the fast GPU only vs. DistributedDataParallel.
I’ve got a similar issue, but in my instance it’s not to much the speed I’m worried about.
I’ve got a machine with a single 3090, and ideally would purchase another but they seem hard to come by since the release of the 40xx series.
If I buy a 4090 and put it in the same machine am I right in that I’ll be limited by the speed of the 3090, but I will be able to take advantage of the extra memory capacity from having two gpus?
This could be the case as the slowest part of your pipeline would create the bottleneck.
It’s unclear if this would be the 3090 (or e.g. the data loading) and depends on your actual use case.
Sorry I should have been clearer. What I meant was, can I assume that if I’m using a 3090 and 4090 in the same machine, I won’t get the speed of the 4090 because I’m using it in conjunction with a slower GPU, but I will be able to take advantage of having the additional GPU memory. So I can use larger models, batch sizes, etc.
Yes, as mentioned before you might not be able to get the full performance of the 4090 as it would have to wait for the slowest part of your entire training pipeline. It could be the 3090, it could also be any other part such as the data loading and even your 3090 could already be running into the bottleneck, so I would recommend to profile the workload to see how the overall training behaves.
@ptrblck Would it make sense to distribute batches in proportions corresponding to the performance of each GPU? For example, we could run 24 samples on the 4090 and 8 on the 3090, or find another optimum distribution that would result in a speedup.
Yes, this could be a valid idea you could experiment with. However, note that DDP divides the gradients by the world_size as seen here so you might need to scale the gradients manually since your batch size differs on each rank.
Here is a simple example showing the different outputs using balanced vs. imbalanced batches:
# balanced DDP workload
world_size = 2
batch_size = 16
global_batch_size = world_size * batch_size
g1 = torch.arange(batch_size).float()
g2 = torch.arange(batch_size, batch_size*2)
g_allreduced = (g1.sum() / batch_size + g2.sum() / batch_size) / world_size
print(g_allreduced)
# tensor(15.5000)
print(g_allreduced == torch.cat((g1, g2)).sum() / global_batch_size)
# tensor(True)
# imbalanced DDP workload
batch_size1 = 24
batch_size2 = 8
g1 = torch.arange(batch_size1).float()
g2 = torch.arange(batch_size1, batch_size1+batch_size2).float()
g_allreduced = (g1.sum() / batch_size1 + g2.sum() / batch_size2) / world_size
print(g_allreduced)
# tensor(19.5000)
# scaling the imbalanced DDP workload
g_allreduced = (g1.sum() / batch_size1 * batch_size1/batch_size + g2.sum() / batch_size2 * batch_size2/batch_size) / world_size
print(g_allreduced)
# tensor(15.5000)
You could also use other approaches to create the right average gradients e.g. by using reduction="sum" in the criterion and then normalizing afterwards.
I have a specific research case where I’m trying manually delegate tasks to an intel A770 and a RTX 3090. I can’t seem to create an environment with XPU and Cuda support using pip. Is there a pathway to this if I build from source
I haven’t heard of any builds supporting multiple different accelerators and thus don’t believe it’s possible at the moment.