Gpu slower than cpu for some operations -- pytorch 0.3.0

Hello Forum!

I have a simple model whose training slows down when moved to
the gpu. Details follow, but first, here are the timings:

20,000 batch training iterations:
cpu: 23.93 secs.
gpu: 37.19 secs.

However, the gpu is not slower for all operations:

20,000 batch training iterations + 2,000 test evaluations:
cpu: 111.94 secs.
gpu: 61.47 secs.

Please note: this is with pytorch 0.3.0 on an old laptop
gpu: “torch.cuda.get_device_name (0) = Quadro K1100M”.

I don’t necessarily expect the gpu to be significantly faster
than the cpu, but I was surprised that it was this much slower.

Is this something I should expect? Are there known pitfalls
that can slow down the gpu that I should be looking for?
Is this likely to be a 0.3.0 or old gpu issue?

Now some details: This is a simple, single-hidden-layer
model that performs digit classification (not MNIST, but
similar). The training and test inputs are in a single
9298 x 256 FloatTensor, and the training and test labels
are in a single 9298 LongTensor. (The first 5000 items
are used for training and the remaining 4298 for test.)

The training batch size is 25; thus 20,000 predict-loss-
backward-optimize (with a batch accuracy thrown in) training
steps correspond to 100 epochs. When testing is turned
on, after every ten training iterations, a predict-loss-
accuracy computation is performed on the full 4298-sample
test set, processed as a single batch. (The accuracy
computations do not contribute significantly to the timings.)

The full training / test tensors are moved to the gpu before
the training loop. Every epoch, a new random permutation of
indices into the training set, used to select random batches,
is generated on the cpu and moved to the gpu. (The repeated
random-index permutations do not contribute significantly to
the timings.)

Naively, the time to run the training and test should be simply
additive. Thus:

20,000 training iterations (batch size = 25):
cpu: 23.93 secs.
gpu: 37.19 secs.

2,000 test evaluations (batch size = 4298):
cpu: 88.01 secs.
gpu: 24.28 secs.

It makes sense to me that the large batch size might benefit
significantly from the gpu, but I’m surprised that the training
iterations are slower on the gpu, even though the batch size
is “only” 25. Remember, the entire dataset was moved to the
gpu before starting the training loop.

For completeness, here is my model:

model = Sequential(
  (0): Linear(in_features=256, out_features=512)
  (1): Tanh()
  (2): Linear(in_features=512, out_features=10)
)

and I use nn.CrossEntropyLoss() for the loss.

Have I missed something somewhere that is slowing down the
gpu, or is this to be expected?

Thanks.

K. Frank

Are you creating some synchronization points in your training loop, e.g. via print statements?
I assume I could profile the code via:

data = torch.randn(9298, 256)
target = torch.randint(0, 10, (9298,))
model = # your definition

If I remember correctly, you cannot install a newer PyTorch version, so I would at least want to make sure, we don’t have a regression in the latest version.

Hi @ptrblck!

It looks like this is a batch-size issue.

I ran some timings with dummy data. Note, this is just a loop
over the forward pass – no loss, backward, or optimization.

For large batch sizes (>~ 1000) the gpu runs much faster than
the cpu; for the batch size of 25 I’ve been using with my toy
model, the gpu runs rather more slowly than the cpu.

I’m running a toy model, and using a batch size of 25 for no
particular reason, so I don’t really care. But I would be
curious whether this behavior persists in newer versions of
pytorch and on other cpus, and if there is a way to run “small”
batch sizes efficiently on the gpu.

Here is the timing script:

import time

import torch
print  (torch.__version__)

batchSize = 25
nEpoch = 100  # notioanal epoch of 5000
nBatch = (nEpoch * 5000) // batchSize

data = torch.autograd.Variable (torch.randn (9298, 256))
batchInd = torch.LongTensor (range (batchSize))
model = torch.nn.Sequential(
    torch.nn.Linear (256, 512),
    torch.nn.Tanh(),
    torch.nn.Linear (512, 10),
)

print ('run cpu ...')
cpuTime = time.time()
for  i in range (nBatch):
    preds = model (data[batchInd])
cpuTime = time.time() - cpuTime

print ('torch.cuda.get_device_name (0) =', torch.cuda.get_device_name (0))
data = data.cuda()
batchInd = batchInd.cuda()
model.cuda()

print ('run gpu ...')
gpuTime = time.time()
for  i in range (nBatch):
    preds = model (data[batchInd])
gpuTime = time.time() - gpuTime

print  ('batchSize =', batchSize, ', nBatch =', nBatch, ', cpuTime =', cpuTime, ', gpuTime =', gpuTime)

Here is the complete output for batchSize = 25:

0.3.0b0+591e73e
run cpu ...
torch.cuda.get_device_name (0) = Quadro K1100M
run gpu ...
batchSize = 25 , nBatch = 20000 , cpuTime = 5.642305850982666 , gpuTime = 7.473216533660889

And here are timings for a series of batch sizes:

batchSize = 5 , nBatch = 100000 , cpuTime = 11.414417505264282 , gpuTime = 28.121765613555908
batchSize = 10 , nBatch = 50000 , cpuTime = 7.832420349121094 , gpuTime = 14.703708171844482
batchSize = 25 , nBatch = 20000 , cpuTime = 5.642305850982666 , gpuTime = 7.473216533660889
batchSize = 50 , nBatch = 10000 , cpuTime = 4.890258312225342 , gpuTime = 3.968475341796875
batchSize = 100 , nBatch = 5000 , cpuTime = 4.532914876937866 , gpuTime = 2.2654550075531006
batchSize = 250 , nBatch = 2000 , cpuTime = 4.889425277709961 , gpuTime = 1.4718339443206787
batchSize = 500 , nBatch = 1000 , cpuTime = 5.050147533416748 , gpuTime = 1.1878798007965088
batchSize = 1000 , nBatch = 500 , cpuTime = 5.072073698043823 , gpuTime = 0.7794077396392822
batchSize = 2500 , nBatch = 200 , cpuTime = 4.958431720733643 , gpuTime = 0.1718151569366455
batchSize = 5000 , nBatch = 100 , cpuTime = 4.986212730407715 , gpuTime = 0.1562213897705078

(The cpu timings of <~5 are all “about the same”, with differences
within the run-to-run noise.)

Thanks.

K. Frank

Edit: I also tried taking the indexing out, i.e.,

data = torch.autograd.Variable (torch.randn (batchSize, 256))
...
for  i in range (nBatch):
    preds = model (data)

The timings became modestly smaller, but they showed the same
pattern with respect to the batch size and gpu vs. cpu.

Thanks for the code!
Note that CUDA operations are asynchronous so you would need synchronize before starting and stopping the timer for the GPU operation via torch.cuda.synchronize().

Currently you might time the kernel launch times, which are higher for a smaller batch size (and thus more launches).

Could you add the synchronization and profile the code again?

Hello @ptrblck!

Here I add the synchronization:

torch.cuda.synchronize()
gpuTime = time.time()
for  i in range (nBatch):
    preds = model (data[batchInd])
torch.cuda.synchronize()
gpuTime = time.time() - gpuTime

And here are the synchronized timings for the batch sizes:

batchSize = 5 , nBatch = 100000 , cpuTime = 11.282513856887817 , gpuTime = 27.322113752365112
batchSize = 10 , nBatch = 50000 , cpuTime = 7.730835914611816 , gpuTime = 14.358086585998535
batchSize = 25 , nBatch = 20000 , cpuTime = 5.687099933624268 , gpuTime = 7.724848031997681
batchSize = 50 , nBatch = 10000 , cpuTime = 4.814324617385864 , gpuTime = 3.956281900405884
batchSize = 100 , nBatch = 5000 , cpuTime = 4.489711046218872 , gpuTime = 2.366971969604492
batchSize = 250 , nBatch = 2000 , cpuTime = 4.938727378845215 , gpuTime = 1.6444478034973145
batchSize = 500 , nBatch = 1000 , cpuTime = 5.077881097793579 , gpuTime = 1.5218729972839355
batchSize = 1000 , nBatch = 500 , cpuTime = 4.937913656234741 , gpuTime = 1.419776201248169
batchSize = 2500 , nBatch = 200 , cpuTime = 5.141258001327515 , gpuTime = 1.3623504638671875
batchSize = 5000 , nBatch = 100 , cpuTime = 4.970717668533325 , gpuTime = 1.3436284065246582

Thanks.

K. Frank