Single node multiGPU + multiCPU parallelism?

I got the DataParallel() example for multi-GPU working successfully for my model.

However, I happen to have a 64-core Xeon Phi CPU, and I can’t stand looking at it sitting idle. How can I saturate the CPU by assigning it some work? In other words, can I split the workload on GPU0, GPU1, and CPU0-255 and sum the gradients on CPU?

Thank you!

I don’t know very much about Xeon Phi, but I believe that Intel has an MKL version that will parallelize BLAS calls over the whole chip, making it work like a single GPU device. I would do some experiments to compare speed for each part of your network, then use model parallelism to put submodules on the device they work best on. Or you could subclass/modify the code for DataParallel to allow the CPU (Phi) to be one of the included devices.

1 Like

Thanks James.

I had a subclassing script for Keras based on Kuza55’s script: it replicated models to /gpu0, /gpu1, and /cpu0-255.

I will look into subclassing to use both CPU and GPU. What are the device IDs for CPUs inside PyTorch?

Keras kuza55 script: script:

CPUs don’t have device IDs; there’s just one kind of CPU tensor and operations with them are implemented in TH and farmed out to MKL, which would then have its own strategy for parallelizing over the Phi.

1 Like