Hi,
I am trying to adapt https://github.com/pytorch/examples/tree/main/imagenet/main.py to Imagenet100. As my computer is not that spec’d up (2xRTX 2080) and the long training time, i decided to use Imagenet100 which meant i needed to make a slight change to resnet.py (https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py). Please see GitHub - rnangia/Imagenet100: Altering Imagenet code from Pytorch examples to work with Imagenet100.
I had Nan issues. I thought it had something to do with an old version of Pytorch/CUDA/Cudnn; so I updated them. As it now stands, It is Cuda 11.8, Cudnn 8.9, torch 2.0.1+cu118.
So, then I thought maybe it is a normalization issue. I checked GitHub - danielchyeh/ImageNet-100-Pytorch: (Pytorch) Training ResNets on ImageNet-100 data and he uses the same imagenet normalization figures for 100 as it is for 1000 categories.
I am lost where to go next from here. Please see attached logs.
Epoch: [2][401/986] Time 0.391 ( 0.400) Data 0.000 ( 0.038) Loss 3.1367e+00 (3.4135e+00) Acc@1 25.00 ( 17.94) Acc@5 48.44 ( 44.55)
Epoch: [2][401/986] Time 0.391 ( 0.396) Data 0.000 ( 0.032) Loss 3.1370e+00 (3.4206e+00) Acc@1 25.00 ( 17.39) Acc@5 43.75 ( 44.02)
Epoch: [2][411/986] Time 0.405 ( 0.400) Data 0.000 ( 0.037) Loss 3.7681e+00 (3.4103e+00) Acc@1 15.62 ( 18.01) Acc@5 37.50 ( 44.63)
Epoch: [2][411/986] Time 0.404 ( 0.396) Data 0.000 ( 0.032) Loss 3.2360e+00 (3.4179e+00) Acc@1 23.44 ( 17.43) Acc@5 50.00 ( 44.07)
Epoch: [2][421/986] Time 0.404 ( 0.400) Data 0.000 ( 0.037) Loss 3.3991e+00 (3.4085e+00) Acc@1 14.06 ( 17.99) Acc@5 46.88 ( 44.65)
Epoch: [2][421/986] Time 0.402 ( 0.396) Data 0.000 ( 0.032) Loss 3.2751e+00 (3.4179e+00) Acc@1 20.31 ( 17.48) Acc@5 46.88 ( 44.10)
Epoch: [2][431/986] Time 0.389 ( 0.400) Data 0.000 ( 0.037) Loss 3.5405e+00 (3.4069e+00) Acc@1 17.19 ( 18.02) Acc@5 42.19 ( 44.69)
Epoch: [2][431/986] Time 0.391 ( 0.396) Data 0.000 ( 0.032) Loss 3.2519e+00 (3.4150e+00) Acc@1 28.12 ( 17.56) Acc@5 46.88 ( 44.15)
Epoch: [2][441/986] Time 0.383 ( 0.400) Data 0.000 ( 0.037) Loss 3.2058e+00 (3.4029e+00) Acc@1 17.19 ( 18.09) Acc@5 56.25 ( 44.83)
Epoch: [2][441/986] Time 0.384 ( 0.396) Data 0.000 ( 0.032) Loss 3.5359e+00 (3.4134e+00) Acc@1 18.75 ( 17.59) Acc@5 39.06 ( 44.25)
Epoch: [2][451/986] Time 0.372 ( 0.399) Data 0.000 ( 0.037) Loss nan (nan) Acc@1 1.56 ( 17.87) Acc@5 4.69 ( 44.34)
Epoch: [2][451/986] Time 0.370 ( 0.396) Data 0.000 ( 0.031) Loss nan (nan) Acc@1 0.00 ( 17.36) Acc@5 1.56 ( 43.77)
Epoch: [2][461/986] Time 0.367 ( 0.399) Data 0.000 ( 0.036) Loss nan (nan) Acc@1 1.56 ( 17.50) Acc@5 3.12 ( 43.51)
Epoch: [2][461/986] Time 0.366 ( 0.395) Data 0.000 ( 0.031) Loss nan (nan) Acc@1 6.25 ( 17.02) Acc@5 10.94 ( 42.94)
Epoch: [2][471/986] Time 0.367 ( 0.398) Data 0.000 ( 0.036) Loss nan (nan) Acc@1 1.56 ( 17.14) Acc@5 1.56 ( 42.69)
Epoch: [2][471/986] Time 0.366 ( 0.394) Data 0.000 ( 0.031) Loss nan (nan) Acc@1 0.00 ( 16.68) Acc@5 3.12 ( 42.14)
Epoch: [2][481/986] Time 0.367 ( 0.397) Data 0.000 ( 0.036) Loss nan (nan) Acc@1 0.00 ( 16.81) Acc@5 7.81 ( 41.89)
Epoch: [2][481/986] Time 0.366 ( 0.394) Data 0.000 ( 0.031) Loss nan (nan) Acc@1 3.12 ( 16.36) Acc@5 15.62 ( 41.40)
Epoch: [2][491/986] Time 0.368 ( 0.397) Data 0.000 ( 0.036) Loss nan (nan) Acc@1 0.00 ( 16.48) Acc@5 3.12 ( 41.12)
Epoch: [2][491/986] Time 0.366 ( 0.393) Data 0.000 ( 0.031) Loss nan (nan) Acc@1 1.56 ( 16.04) Acc@5 4.69 ( 40.66)
Epoch: [2][501/986] Time 0.367 ( 0.396) Data 0.000 ( 0.035) Loss nan (nan) Acc@1 0.00 ( 16.16) Acc@5 0.00 ( 40.38)