Using multiple CPU cores for training

bbrighttaer · June 26, 2019, 5:55am

Hi @all,
I’m new to pytorch and currently trying my hands on an mnist model. I do not have a GPU but have 24 CPU cores and >100GB RAM (using torch.get_num_threads()). However, I do not observe any significant improvement in training speed when I use torch.set_num_threads(10) - it seems to me that there isn’t any difference between setting the number of threads and not having at all.
I would like to know how I can take advantage of the multiple CPU cores available during model training. I have also tried setting num_workers of the data loader but to no avail.

pietern · June 26, 2019, 6:49am

Setting the number of threads will only make some individual operations faster (e.g. big matrix multiplication or convolution), if they work on big tensors. For your example, this might accelerate some of the big fully connected layers, if you use a batch size that’s big enough. Alternatively, you can explore running more processes, and using torch.nn.parallel.DistributedDataParallel to parallelize across processes.

bbrighttaer · June 26, 2019, 4:00pm

Thanks for your direction. I have tried using torch.nn.parallel.DistributedDataParallelCPU and the forward pass is able to utilize the number of processes I set (I assume that’s the same as cpu cores in my case). I followed the tutorial here. However, there’s a lengthy block, for what I think is the backward pass, before any forward pass is observed.
Any suggestion on how to address this?

pietern · June 27, 2019, 11:27am

What do you mean with a “length block”?

bbrighttaer · June 27, 2019, 12:58pm

Sorry, it’s a typo. I mean a ‘lengthy block’ of all forward pass ops before the spawned processes do the next forward pass.

pietern · June 27, 2019, 1:29pm

Do you mean a lengthy block of time? That you observe upon starting the processes?

It is possible the first forward pass take a bit longer than subsequent ones due to memory allocation and general initialization of all the operators/backends.

bbrighttaer · June 27, 2019, 1:49pm

Yes please, lengthy block of time.

pietern · June 27, 2019, 2:09pm

If this only happens in the first iteration, it’s likely memory allocation and initialization stuff. If subsequent iterations also take longer than you expect, it is possible you have started too many processes and are overloading your system.

Murtaza_Basu · March 20, 2020, 11:27am

Is torch.nn.parallel.DistributedDataParallel only applicable to GPU and not to CPU with multi cores?

mrshenli · March 20, 2020, 2:33pm

It works with CPUs with multi cores. From the DistributedDataParallel doc:

For multi-device modules and CPU modules, device_ids must be None or an empty list, and input data for the forward pass must be placed on the correct device.

The thing is that as there is only one “cpu” device in PyTorch, you cannot specify which cores to run a DDP process using the device_ids arg in DistributedDataParallel constructor. However, you should still be able to set the CPU affinity for processes independently?