Any reason why the following PyTorch (3s/epoch) code is so much slower than MXNet's version (~0.6s/epoch)?

I’m also not much of a speed difference between the small MXNet model and the small PyTorch model. I made a few changes to your script:

  1. torch.backends.cudnn.benchmark = True at the top
  2. removed the call to test()
  3. set the data loader to num_workers=6 and pin_memory=False

I ran python3 script-small.py --epochs 30 and time python mnist.py --train --no-lsoftmax --gpu 0 --batch-size 256 --num-epoch 30.

The MXNet took 39 seconds to train 30 epochs. PyTorch took 47 seconds. The difference is largely because the dataloader re-creates the 6 worker processes each epoch. This takes about 200ms each epoch. That’s not significant for most datasets, but MNIST is tiny. I’m running on a NVIDIA GP100 with mxnet-cu90mkl --pre and PyTorch master (which should be the same perf as 0.4)

I am confused here. You say I’m seeing ~4 seconds per epoch. Then the next post you say 30 epochs take 47 seconds for pytorch. Aren’t these two statements contradictory? I am also getting ~4s per epoch which gives me ~120s in total for 30 epochs (using the same script you use for profiling to ensure no code difference). How are you getting 47s?