Model parallelism in Multi-GPUs: forward/backward graph

I just realized that my answer was in a different context … regarding your question, I thought you were referring to Uneven GPU utilization during training backpropagation So my answer may be a bit out of context given the problem discussed in this thread :stuck_out_tongue: