How does PyTorch parallelize across multiple CPU cores?

We can define the number of cores to be used for CPU training with torch.set_num_threads(). But how exactly does PyTorch parallelize across multiple cores (no batching is involved here)? I’ve seen a custom training loop not in PyTorch that distributes the data across multiple cores with the model on each core, loss functions then evaluated, returned to the main process, optimizer is called, and the updated models then sent to the cores and the process is repeated. How does this compare to PyTorch? Thanks!


The low level CPU compute libraries that we use like mkl use multithreading if a single task is big enough.
For each library, this flag is used to select the number of threads that are available.