Training MobileNet on Imagenet

pytorcher · March 9, 2023, 3:13am

According to the official pytorch docs Mobilenet V3 Small should reach:

acc@1 (on ImageNet-1K)	67.668
acc@5 (on ImageNet-1K)	87.402

When I run the ImageNet Example Code however, the results are abysmal. Barely getting 10% acc@1 accuracy with default settings. I’m sure using the exact parameters/optimizers from the paper would improve things but something must be wrong that they are this bad. Has anyone done this before? I suspect I may be missing a significant pre/post processing step. It’s working as expected with resnet18 so I’m confident nothing is wrong with the data and its something about replacing the resnet18 model with the MobileNet one.

ptrblck · March 9, 2023, 6:56am

Did you follow these training steps which are linked in the docs and checked the referred GitHub issues?

pytorcher · March 10, 2023, 11:05pm

Thanks for pointing me to that! Got it up and running but it also doesn’t seem to be working. Is the “torchrun” command necessary? My understanding is that’s just for running on multiple GPUs.
I’m running:

python train.py --model mobilenet_v3_small --epochs 600 --opt rmsprop --batch-size 128 --lr 0.064 --wd 0.00001 --lr-step-size 2 --lr-gamma 0.973 --auto-augment imagenet --random-erase 0.2 --data-path ~/data/imagenet/

After 30 epochs:

Epoch: [29]  [10000/10010]  eta: 0:00:00  lr: 0.043627440786307564  img/s: 1559.5288102901363  loss: 6.9306 (6.9313)  acc1: 0.0000 (0.0972)  acc5: 0.0000 (0.5093)  time: 0.0823  data: 0.0001  max mem: 2309
Epoch: [29] Total time: 0:13:57
Test:   [  0/391]  eta: 0:15:23  loss: 6.8420 (6.8420)  acc1: 0.0000 (0.0000)  acc5: 0.0000 (0.0000)  time: 2.3624  data: 2.3319  max mem: 2309
Test:   [100/391]  eta: 0:00:30  loss: 6.8788 (6.9304)  acc1: 0.0000 (0.0000)  acc5: 0.0000 (0.7735)  time: 0.0893  data: 0.0613  max mem: 2309
Test:   [200/391]  eta: 0:00:19  loss: 6.9435 (6.9245)  acc1: 0.0000 (0.1943)  acc5: 0.0000 (0.5830)  time: 0.1009  data: 0.0727  max mem: 2309
Test:   [300/391]  eta: 0:00:08  loss: 6.9327 (6.9321)  acc1: 0.0000 (0.1298)  acc5: 0.0000 (0.5191)  time: 0.0715  data: 0.0434  max mem: 2309
Test:  Total time: 0:00:37
Test:  Acc@1 0.100 Acc@5 0.500

I’m confident my imagenet data is ok because i ran the resnet18 example on it and it seemed fine.

pytorcher · March 19, 2023, 11:59pm

@ptrblck One step forward. Just the original does learn with mobilenet with default settings instead of rmsprop. Is it possible the rmsprop parameters were not added correctly?

python train.py --model mobilenet_v3_small --data-path ~/data/imagenet/

ptrblck · March 20, 2023, 2:53am

I don’t know but @pmeier might know more about this model and how it was trained.
Based on this PR Vasilis added the model, but I cannot find his user name here in case he has an account.

pmeier · March 20, 2023, 9:24am

@pytorcher How many GPUs are you running with? From

CUDA_VISIBLE_DEVICES=0

I guess 1? If so, your learning rate is way to high. We trained the model with 8 GPUs, so you should roughly divide the learning rate by 8. Or you could multiply the batch size by 8, but I guess your setup (and ours as well) cannot handle that.

This should get you at least into the right ball park. However, all the other hyperparameters that we used are tuned to our setup. So you probably need to touch all the others as well if you want to achieve the same performance.

pytorcher · March 20, 2023, 10:36pm

Yes! This was it! Thank you so much. I forgot batches are summed and not averaged and that the “torchrun” I removed reduces the total size of batches by a factor of 8 and will make a huge difference. Dividing the learning rate by 8 has put me on the right track. 30 epochs down and now at Test: Acc@1 29.944 Acc@5 55.040. Optimistic that’ll get me to the right place, or if not at least learning is happening and I can play with the parameters to try to improve.

Final command for anyone who looks in the future:

CUDA_VISIBLE_DEVICES=0 python train.py --model mobilenet_v3_small --data-path ~/data/imagenet/ --epochs 600 --opt rmsprop --batch-size 128 --lr 0.008 --wd 0.00001 --lr-step-size 2 --lr-gamma 0.973 --auto-augment imagenet --random-erase 0.2