I’m also not much of a speed difference between the small MXNet model and the small PyTorch model. I made a few changes to your script:
-
torch.backends.cudnn.benchmark
= True at the top - removed the call to test()
- set the data loader to
num_workers=6
andpin_memory=False
I ran python3 script-small.py --epochs 30
and time python mnist.py --train --no-lsoftmax --gpu 0 --batch-size 256 --num-epoch 30
.
The MXNet took 39 seconds to train 30 epochs. PyTorch took 47 seconds. The difference is largely because the dataloader re-creates the 6 worker processes each epoch. This takes about 200ms each epoch. That’s not significant for most datasets, but MNIST is tiny. I’m running on a NVIDIA GP100 with mxnet-cu90mkl --pre
and PyTorch master (which should be the same perf as 0.4)