Standard Vision models not retraining properly?

Hey all,

I’m running some models from the model-zoo; using these examples: https://github.com/pytorch/examples/tree/master/imagenet

Given for example alexnet, I try to retrain it, but the performance tanks dramatically. Using just this cloned repo, running: -a alexnet --lr 0.01 --pretrained [image_net location]

performance starts off strong, but in a few iterations already decreases from 80ish% accuracy to 63 and stays there. First few epochs:
Epoch: [0][0/5005] Time 8.089 (8.089) Data 3.601 (3.601) Loss 2.0473 (2.0473) Prec@1 54.297 (54.297) Prec@5 78.906 (78.906)
Epoch: [0][10/5005] Time 0.856 (1.183) Data 0.787 (0.611) Loss 2.4228 (2.2277) Prec@1 49.219 (50.639) Prec@5 71.484 (73.864)
Epoch: [0][20/5005] Time 1.752 (1.089) Data 1.675 (0.747) Loss 2.3623 (2.3440) Prec@1 47.266 (48.400) Prec@5 71.484 (71.540)
Epoch: [0][30/5005] Time 0.175 (1.029) Data 0.000 (0.773) Loss 2.9376 (2.4701) Prec@1 36.328 (45.867) Prec@5 62.891 (69.619)
Epoch: [0][40/5005] Time 1.708 (1.010) Data 1.632 (0.797) Loss 2.8086 (2.5262) Prec@1 39.453 (44.741) Prec@5 66.016 (68.807)
Epoch: [0][50/5005] Time 1.624 (0.993) Data 1.543 (0.805) Loss 2.6798 (2.5577) Prec@1 42.969 (44.118) Prec@5 62.109 (68.252)
Epoch: [0][60/5005] Time 0.312 (0.984) Data 0.249 (0.811) Loss 2.4672 (2.5890) Prec@1 47.266 (43.654) Prec@5 70.312 (67.802)
Epoch: [0][70/5005] Time 1.153 (0.974) Data 1.082 (0.812) Loss 2.5792 (2.6097) Prec@1 44.141 (43.277) Prec@5 70.703 (67.463)
Epoch: [0][80/5005] Time 0.200 (0.979) Data 0.018 (0.823) Loss 2.5768 (2.6335) Prec@1 44.141 (42.670) Prec@5 68.359 (67.086)

Things I tried:

  • Make sure the dataset is correct, it is on comparison with someone else’s instance
  • retrain alexnet from scratch; currently on epoch 14 and Prec@5 is around 58 now. Lower than the 63ish from the retraining network, but I guess this should go up for the next 90 epochs (taking a while of course).
  • Changing learning rates. Lower learning rates cause more stable learning of course, but unless I put the learning rate on 1e-5 performance still drops a lot.
  • Changing optimization method; Adam seems to decrease results even more dramatically in the first few iterations.
  • I also tried this for inception v3 (my initial goal), which has the same problems.
  • Interestingly, test scores get higher than the train scores. on epoch 12 of the retraining, the net scores around * Prec@1 47.254 Prec@5 72.312 . on testing data, as opposed to Prec@1 37.891 (38.853) Prec@5 64.062 (63.824) on the training data. That seems weird to me.

Anyone have an idea of what I’m missing here? I reckoned the retraining/finetuning should just work out of the box. Perhaps there’s some small difference when doing things like data-preprocessing in this code, but then I wouldn’t think performance would drop hard in the first iterations…

Specs: Python Python 3.5.2. pytorch version ‘0.2.0_3’.