When I run the ImageNet Example Code however, the results are abysmal. Barely getting 10% acc@1 accuracy with default settings. I’m sure using the exact parameters/optimizers from the paper would improve things but something must be wrong that they are this bad. Has anyone done this before? I suspect I may be missing a significant pre/post processing step. It’s working as expected with resnet18 so I’m confident nothing is wrong with the data and its something about replacing the resnet18 model with the MobileNet one.
Thanks for pointing me to that! Got it up and running but it also doesn’t seem to be working. Is the “torchrun” command necessary? My understanding is that’s just for running on multiple GPUs.
I’m running:
@ptrblck One step forward. Just the original does learn with mobilenet with default settings instead of rmsprop. Is it possible the rmsprop parameters were not added correctly?
I don’t know but @pmeier might know more about this model and how it was trained.
Based on this PR Vasilis added the model, but I cannot find his user name here in case he has an account.
I don’t know if this is useful or not but the original paper states:
“We train our models using synchronous training setup on 4x4 TPU Pod [24] using standard tensorflow RMSPropOptimizer with 0.9 momentum. We use the initial learning rate of 0.1, with batch size 4096 (128 images per chip), and learning rate decay rate of 0.01 every 3 epochs. We use dropout of 0.8, and l2 weight decay 1e-5 and the same image preprocessing as Inception [42]. Finally we use exponential moving average with decay 0.9999. All our convolutional layers use batch-normalization layers with average decay of 0.99.”
I tryed with the following settings which also ends up not learning anything after 1 epoch.
@pytorcher How many GPUs are you running with? From
CUDA_VISIBLE_DEVICES=0
I guess 1? If so, your learning rate is way to high. We trained the model with 8 GPUs, so you should roughly divide the learning rate by 8. Or you could multiply the batch size by 8, but I guess your setup (and ours as well) cannot handle that.
This should get you at least into the right ball park. However, all the other hyperparameters that we used are tuned to our setup. So you probably need to touch all the others as well if you want to achieve the same performance.
Yes! This was it! Thank you so much. I forgot batches are summed and not averaged and that the “torchrun” I removed reduces the total size of batches by a factor of 8 and will make a huge difference. Dividing the learning rate by 8 has put me on the right track. 30 epochs down and now at Test: Acc@1 29.944 Acc@5 55.040. Optimistic that’ll get me to the right place, or if not at least learning is happening and I can play with the parameters to try to improve.