Uneven GPU utilization during training backpropagation

I actually just found that someone implemented a solution to this problem. While this is exactly what you were looking for, i haven’t tried it myself yet and am not sure how well it works – in theory though, it addresses the problem of unbalanced loads in DataParallel. They describe it in the context of Semantic Segmentation, but I assume it should work in more general and encompass other objective functions:

https://hangzhang.org/PyTorch-Encoding/parallel.html