Scaling data parallel training on a single-machine with multiple CPUs (no GPUs)

What is the best way to scale data parallel training on a single machine with multiple CPUs (no GPUs)?

  • We need to speed up training for a customer, because the training dataset grew substantially recently
  • We can’t use GPUs, but we can increase CPU-cores and memory on a dedicated machine

I researched the usual options for accelerating PyTorch, but I can’t figure out what the “right” approach is for a single-machine multiple-CPUs scenario:

1 DataParallel and DistributedDataParallel
From reading the documentation, I got the impression that DataParallel and DistributedDataParallel are designed to work only with GPUs.

Is that assumption correct?
If not, could you point me to a code sample for correctly setting those up for CPUs?

2 Vanilla PyTorch on CPUs
We tested our vanilla PyTorch training loop on a single 8-core CPU machine.
All cores were used during training, which implies

  • that PyTorch is somehow parallelizing across CPUs already
  • and that we could add cores to speed things up.

But what is happening under the hood exactly?
We are assumingly not using data parallelization (or model parallelization for that matter), because we are using our unaltered training code.

Is naively adding cores and memory in that setup the right approach?

3. Ray Train
We set up Ray Train (based on this) and it worked fine on a single machine with multiple CPUs. It even has an explicit use_gpu flag to disable GPU usage.

However according to the documentation, Ray Train uses DistributedDataParallel under the hood. Referring to the questions raised in points 1) and 2) above, is Ray Train the recommended way to scale on a single CPU-only machine?

4 did we miss anything?

Thank you!