References/classification/train.py not learning (version v0.12.0)

jpcbertoldo · September 26, 2022, 10:04am

I am trying to train a resnet18 on imagenet using the recipe from torch/vision but it is not learning.

I am launching the script with torchrun references/classification/train.py (version v0.12.0) with the recommended hyperparameters from the readme, except for:

the number of GPUs (they recommend 8 but I only have 4)
batch size (which I increased from 32 to 64 to compensate the number of GPUs);
i added --use-deterministic-algorithms

Result:

torchrun \
    --nproc_per_node=4 \
    train.py \ 
    --workers 4 \ 
    --batch-size 64 \ 
    --lr 0.1 \ 
    --momentum 0.9 \ 
    --weight-decay 1e-4 \ 
    --lr-step-size 30 \ 
    --lr-gamma 0.1 \ 
    --epochs 90 \ 
    --model resnet18 \ 
    --cache-dataset \ 
    --use-deterministic-algorithms \ 
    --data-path /.../imagenet

In fact, I made minor modifications in the script to fix a manual seed and send logs to my wandb, but I belive that shouldn’t have such impact. Here is the diff.

I believe these changes shouldn’t have impacted so much, right?

But the model doesn’t seem to be learning but just overfitting. Here are the training curves for the accu1, and accu5, on the train and val sets.