I have a server equipped with 3 Quadro P2000 and one GTX 1050. Quadros have 5GB of video memory and GTX has 4GB. I use DataParallel for multi gpu processing. But i notised, that batch size limited to 4gb and when i try to increase batch size I catch OOM on GTX.
For example if i use batch size 50 GTX memory is full, but each Quadro use only 4GBs instead of 5GB. So I got 3GB of unused GPU Ram in total. And when I increase batch size to 56 I catch OOM beacuse the GTX memory is full.
So the question is: What is the correct way to use all available gpu memory
I use this code to parallalize training
net = model()
if torch.cuda.device_count() > 1:
net = torch.nn.DataParallel(net, device_ids=list(range(torch.cuda.device_count())))
net.to('cuda')
However, you may want to check that using the 1050 is beneficial from a load-balancing perspective to begin with as it may be bottlenecking training compared to the P2000s even when accounting for the difference in available memory.