I got the DataParallel() example for multi-GPU working successfully for my model.
However, I happen to have a 64-core Xeon Phi CPU, and I can’t stand looking at it sitting idle. How can I saturate the CPU by assigning it some work? In other words, can I split the workload on GPU0, GPU1, and CPU0-255 and sum the gradients on CPU?
I don’t know very much about Xeon Phi, but I believe that Intel has an MKL version that will parallelize BLAS calls over the whole chip, making it work like a single GPU device. I would do some experiments to compare speed for each part of your network, then use model parallelism to put submodules on the device they work best on. Or you could subclass/modify the code for DataParallel to allow the CPU (Phi) to be one of the included devices.
CPUs don’t have device IDs; there’s just one kind of CPU tensor and operations with them are implemented in TH and farmed out to MKL, which would then have its own strategy for parallelizing over the Phi.