Training on Multi-GPUs out of Memory

This is commonly observed in multi-gpu setups, because some kind of aggregation has to be performed on one selected GPU (in the default case, cuda:0). See this answer, which explains the problem in a bit more details, and check the answer just after it for a possible solution (i.e. split the loss on multiple GPUs as well, not just the network).