CUDA: Out of memory error when using multi-gpu

DDP is not only used for multi-node training, but is also speeding up single-node multi-GPU workloads.
The current proposal is to deprecate DataParllel and in this sense to ramp up the documentation on DDP.