Backpropagation on separate GPUs

My backpropagation step is incredibly intensive, both for memory and GPU power. The data loading is not the bottleneck for my network architecture. How do I tell each GPU to calculate the loss step, then accumulate the gradients on the default GPU, rather than running the loss step on the default GPU?