Multi-processing expected performance

I’m experiementing with pytorch for simple regression computations.

Simple model, using nn.Linear

model = nn.Linear(x_train.size(1), 1)
criterion = nn.MSELoss()
optimiser = torch.optim.SGD(model.parameters(), lr = 0.01, momentum=0.95) 

epochs = 30000
last_loss = 0
t =
for epoch in range(epochs):
    epoch +=1
    outputs = model.forward(x_train)
    loss = criterion(outputs, y_actual)
    loss.backward()# back props
    optimiser.step()# update the parameters
    if abs(last_loss - loss) < 1e-5:
    last_loss = loss
print('epoch {}, loss {}'.format(epoch,
print( -t)

I first tried in my macbook, and it used 8 cores. Trains in ~50sec on my dataset
I then tried on a dual Xeon machine with 32 cores and 2 GPUs, and first compared with just CPU, and to my surprise there was barely any difference between the 4 cores and the 32 cores.

Reading about the multi-processing variables, I see that torch.set_num_threads() doesn’t seem to work.
Some other posts suggested setting OMP_NUM_THREADS and MKL_SET_NUM_THREADS env variables.

So I tried that.
What I observe is that OMP and MKL_SET_NUM_THREADS do work (in comparison to set_num_threads that didn’t make a difference) but on the Dual Xeon, setting those to 1 runs 1 thread, but 2 runs 4, 3 runs 6, 4 runs 8, 5 runs 10 etc…

That seems like a bug.

But the more confusing to me is that the timing doesn’t scale with the number of threads:

4 threads -> 1:35
6 threads -> 1:07
8 threads -> 52sec --> equivalent to the macbook 8 cores
10 threads -> 45sec
12 threads -> 41sec
14 threads and over, it tends to oscillate between 37 and 39sec, with no further improvements

So, I can understand that multiprocessing has some overhead, but I was expecting less overhead and a more drastic improvement…

Can anyone explain or point me to some explanation of the expected scaling?

PS: my dataset is proprietary, sorry I cannot share. It’s about 65k rows, 26 features, so it’s not huge.


During Multiprocessing each batch is splitted over all available specified GPUs or CPUs. This is especially helpfull to train large networks with an effective batchsize > 1.

Since your network consists of only 1 Linear layer, the forward pass is really fast. The problem of parallel training is that it has to be synchronized at some points (e.g. to accumulate gradients). Thus synchronization is usually at data input and prediction output/loss calculation.

Regarding this, your model (at least the really fast forward pass) is indeed executed in parallel, but the synchronization results in a serialization somehow to accumulate gradients. If you compare it with a huge network consisting of an arbitrary number of convolutional layers, the forward pass in this huge network takes a much greater amount of time (which is executed in parallel) and thus the gains are much higher.