Standard Vision models not retraining properly?

Tijmen_Blankevoort · November 2, 2017, 10:19am

Hey all,

I’m running some models from the model-zoo; using these examples: https://github.com/pytorch/examples/tree/master/imagenet

Given for example alexnet, I try to retrain it, but the performance tanks dramatically. Using just this cloned repo, running: -a alexnet --lr 0.01 --pretrained [image_net location]

performance starts off strong, but in a few iterations already decreases from 80ish% accuracy to 63 and stays there. First few epochs:
Epoch: [0][0/5005] Time 8.089 (8.089) Data 3.601 (3.601) Loss 2.0473 (2.0473) Prec@1 54.297 (54.297) Prec@5 78.906 (78.906)
Epoch: [0][10/5005] Time 0.856 (1.183) Data 0.787 (0.611) Loss 2.4228 (2.2277) Prec@1 49.219 (50.639) Prec@5 71.484 (73.864)
Epoch: [0][20/5005] Time 1.752 (1.089) Data 1.675 (0.747) Loss 2.3623 (2.3440) Prec@1 47.266 (48.400) Prec@5 71.484 (71.540)
Epoch: [0][30/5005] Time 0.175 (1.029) Data 0.000 (0.773) Loss 2.9376 (2.4701) Prec@1 36.328 (45.867) Prec@5 62.891 (69.619)
Epoch: [0][40/5005] Time 1.708 (1.010) Data 1.632 (0.797) Loss 2.8086 (2.5262) Prec@1 39.453 (44.741) Prec@5 66.016 (68.807)
Epoch: [0][50/5005] Time 1.624 (0.993) Data 1.543 (0.805) Loss 2.6798 (2.5577) Prec@1 42.969 (44.118) Prec@5 62.109 (68.252)
Epoch: [0][60/5005] Time 0.312 (0.984) Data 0.249 (0.811) Loss 2.4672 (2.5890) Prec@1 47.266 (43.654) Prec@5 70.312 (67.802)
Epoch: [0][70/5005] Time 1.153 (0.974) Data 1.082 (0.812) Loss 2.5792 (2.6097) Prec@1 44.141 (43.277) Prec@5 70.703 (67.463)
Epoch: [0][80/5005] Time 0.200 (0.979) Data 0.018 (0.823) Loss 2.5768 (2.6335) Prec@1 44.141 (42.670) Prec@5 68.359 (67.086)

Things I tried:

Make sure the dataset is correct, it is on comparison with someone else’s instance
retrain alexnet from scratch; currently on epoch 14 and Prec@5 is around 58 now. Lower than the 63ish from the retraining network, but I guess this should go up for the next 90 epochs (taking a while of course).
Changing learning rates. Lower learning rates cause more stable learning of course, but unless I put the learning rate on 1e-5 performance still drops a lot.
Changing optimization method; Adam seems to decrease results even more dramatically in the first few iterations.
I also tried this for inception v3 (my initial goal), which has the same problems.
Interestingly, test scores get higher than the train scores. on epoch 12 of the retraining, the net scores around * Prec@1 47.254 Prec@5 72.312 . on testing data, as opposed to Prec@1 37.891 (38.853) Prec@5 64.062 (63.824) on the training data. That seems weird to me.

Anyone have an idea of what I’m missing here? I reckoned the retraining/finetuning should just work out of the box. Perhaps there’s some small difference when doing things like data-preprocessing in this code, but then I wouldn’t think performance would drop hard in the first iterations…

Specs: Python Python 3.5.2. pytorch version ‘0.2.0_3’.