I’m experiementing with pytorch for simple regression computations.
Simple model, using nn.Linear
model = nn.Linear(x_train.size(1), 1)
criterion = nn.MSELoss()
optimiser = torch.optim.SGD(model.parameters(), lr = 0.01, momentum=0.95)
epochs = 30000
last_loss = 0
t = datetime.now()
for epoch in range(epochs):
epoch +=1
optimiser.zero_grad()
outputs = model.forward(x_train)
loss = criterion(outputs, y_actual)
loss.backward()# back props
optimiser.step()# update the parameters
if abs(last_loss - loss) < 1e-5:
break
last_loss = loss
print('epoch {}, loss {}'.format(epoch,loss.data.item()))
print(datetime.now() -t)
I first tried in my macbook, and it used 8 cores. Trains in ~50sec on my dataset
I then tried on a dual Xeon machine with 32 cores and 2 GPUs, and first compared with just CPU, and to my surprise there was barely any difference between the 4 cores and the 32 cores.
Reading about the multi-processing variables, I see that torch.set_num_threads()
doesn’t seem to work.
Some other posts suggested setting OMP_NUM_THREADS
and MKL_SET_NUM_THREADS
env variables.
So I tried that.
What I observe is that OMP
and MKL_SET_NUM_THREADS
do work (in comparison to set_num_threads
that didn’t make a difference) but on the Dual Xeon, setting those to 1 runs 1 thread, but 2 runs 4, 3 runs 6, 4 runs 8, 5 runs 10 etc…
That seems like a bug.
But the more confusing to me is that the timing doesn’t scale with the number of threads:
4 threads -> 1:35
6 threads -> 1:07
8 threads -> 52sec --> equivalent to the macbook 8 cores
10 threads -> 45sec
12 threads -> 41sec
14 threads and over, it tends to oscillate between 37 and 39sec, with no further improvements
So, I can understand that multiprocessing has some overhead, but I was expecting less overhead and a more drastic improvement…
Can anyone explain or point me to some explanation of the expected scaling?
PS: my dataset is proprietary, sorry I cannot share. It’s about 65k rows, 26 features, so it’s not huge.
Thanks