I’m new to pytorch and currently trying my hands on an mnist model. I do not have a GPU but have 24 CPU cores and >100GB RAM (using torch.get_num_threads()). However, I do not observe any significant improvement in training speed when I use torch.set_num_threads(10) - it seems to me that there isn’t any difference between setting the number of threads and not having at all.
I would like to know how I can take advantage of the multiple CPU cores available during model training. I have also tried setting num_workers of the data loader but to no avail.
Setting the number of threads will only make some individual operations faster (e.g. big matrix multiplication or convolution), if they work on big tensors. For your example, this might accelerate some of the big fully connected layers, if you use a batch size that’s big enough. Alternatively, you can explore running more processes, and using
torch.nn.parallel.DistributedDataParallel to parallelize across processes.
Thanks for your direction. I have tried using
torch.nn.parallel.DistributedDataParallelCPU and the forward pass is able to utilize the number of processes I set (I assume that’s the same as cpu cores in my case). I followed the tutorial here. However, there’s a lengthy block, for what I think is the backward pass, before any forward pass is observed.
Any suggestion on how to address this?
What do you mean with a “length block”?
Sorry, it’s a typo. I mean a ‘lengthy block’ of all forward pass ops before the spawned processes do the next forward pass.
Do you mean a lengthy block of time? That you observe upon starting the processes?
It is possible the first forward pass take a bit longer than subsequent ones due to memory allocation and general initialization of all the operators/backends.
Yes please, lengthy block of time.
If this only happens in the first iteration, it’s likely memory allocation and initialization stuff. If subsequent iterations also take longer than you expect, it is possible you have started too many processes and are overloading your system.
Is torch.nn.parallel.DistributedDataParallel only applicable to GPU and not to CPU with multi cores?
It works with CPUs with multi cores. From the
For multi-device modules and CPU modules, device_ids must be None or an empty list, and input data for the forward pass must be placed on the correct device.
The thing is that as there is only one “cpu” device in PyTorch, you cannot specify which cores to run a DDP process using the
device_ids arg in
DistributedDataParallel constructor. However, you should still be able to set the CPU affinity for processes independently?