Hello Forum!
I have a simple model whose training slows down when moved to
the gpu. Details follow, but first, here are the timings:
20,000 batch training iterations:
cpu: 23.93 secs.
gpu: 37.19 secs.
However, the gpu is not slower for all operations:
20,000 batch training iterations + 2,000 test evaluations:
cpu: 111.94 secs.
gpu: 61.47 secs.
Please note: this is with pytorch 0.3.0 on an old laptop
gpu: “torch.cuda.get_device_name (0) = Quadro K1100M”.
I don’t necessarily expect the gpu to be significantly faster
than the cpu, but I was surprised that it was this much slower.
Is this something I should expect? Are there known pitfalls
that can slow down the gpu that I should be looking for?
Is this likely to be a 0.3.0 or old gpu issue?
Now some details: This is a simple, single-hidden-layer
model that performs digit classification (not MNIST, but
similar). The training and test inputs are in a single
9298 x 256 FloatTensor, and the training and test labels
are in a single 9298 LongTensor. (The first 5000 items
are used for training and the remaining 4298 for test.)
The training batch size is 25; thus 20,000 predict-loss-
backward-optimize (with a batch accuracy thrown in) training
steps correspond to 100 epochs. When testing is turned
on, after every ten training iterations, a predict-loss-
accuracy computation is performed on the full 4298-sample
test set, processed as a single batch. (The accuracy
computations do not contribute significantly to the timings.)
The full training / test tensors are moved to the gpu before
the training loop. Every epoch, a new random permutation of
indices into the training set, used to select random batches,
is generated on the cpu and moved to the gpu. (The repeated
random-index permutations do not contribute significantly to
the timings.)
Naively, the time to run the training and test should be simply
additive. Thus:
20,000 training iterations (batch size = 25):
cpu: 23.93 secs.
gpu: 37.19 secs.
2,000 test evaluations (batch size = 4298):
cpu: 88.01 secs.
gpu: 24.28 secs.
It makes sense to me that the large batch size might benefit
significantly from the gpu, but I’m surprised that the training
iterations are slower on the gpu, even though the batch size
is “only” 25. Remember, the entire dataset was moved to the
gpu before starting the training loop.
For completeness, here is my model:
model = Sequential(
(0): Linear(in_features=256, out_features=512)
(1): Tanh()
(2): Linear(in_features=512, out_features=10)
)
and I use nn.CrossEntropyLoss()
for the loss.
Have I missed something somewhere that is slowing down the
gpu, or is this to be expected?
Thanks.
K. Frank