Imagenet pytorch example Nan issues

Hi,

I am trying to adapt https://github.com/pytorch/examples/tree/main/imagenet/main.py to Imagenet100. As my computer is not that spec’d up (2xRTX 2080) and the long training time, i decided to use Imagenet100 which meant i needed to make a slight change to resnet.py (https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py). Please see GitHub - rnangia/Imagenet100: Altering Imagenet code from Pytorch examples to work with Imagenet100.
I had Nan issues. I thought it had something to do with an old version of Pytorch/CUDA/Cudnn; so I updated them. As it now stands, It is Cuda 11.8, Cudnn 8.9, torch 2.0.1+cu118.
So, then I thought maybe it is a normalization issue. I checked GitHub - danielchyeh/ImageNet-100-Pytorch: (Pytorch) Training ResNets on ImageNet-100 data and he uses the same imagenet normalization figures for 100 as it is for 1000 categories.
I am lost where to go next from here. Please see attached logs.

Epoch: [2][401/986]	Time  0.391 ( 0.400)	Data  0.000 ( 0.038)	Loss 3.1367e+00 (3.4135e+00)	Acc@1  25.00 ( 17.94)	Acc@5  48.44 ( 44.55)
Epoch: [2][401/986]	Time  0.391 ( 0.396)	Data  0.000 ( 0.032)	Loss 3.1370e+00 (3.4206e+00)	Acc@1  25.00 ( 17.39)	Acc@5  43.75 ( 44.02)
Epoch: [2][411/986]	Time  0.405 ( 0.400)	Data  0.000 ( 0.037)	Loss 3.7681e+00 (3.4103e+00)	Acc@1  15.62 ( 18.01)	Acc@5  37.50 ( 44.63)
Epoch: [2][411/986]	Time  0.404 ( 0.396)	Data  0.000 ( 0.032)	Loss 3.2360e+00 (3.4179e+00)	Acc@1  23.44 ( 17.43)	Acc@5  50.00 ( 44.07)
Epoch: [2][421/986]	Time  0.404 ( 0.400)	Data  0.000 ( 0.037)	Loss 3.3991e+00 (3.4085e+00)	Acc@1  14.06 ( 17.99)	Acc@5  46.88 ( 44.65)
Epoch: [2][421/986]	Time  0.402 ( 0.396)	Data  0.000 ( 0.032)	Loss 3.2751e+00 (3.4179e+00)	Acc@1  20.31 ( 17.48)	Acc@5  46.88 ( 44.10)
Epoch: [2][431/986]	Time  0.389 ( 0.400)	Data  0.000 ( 0.037)	Loss 3.5405e+00 (3.4069e+00)	Acc@1  17.19 ( 18.02)	Acc@5  42.19 ( 44.69)
Epoch: [2][431/986]	Time  0.391 ( 0.396)	Data  0.000 ( 0.032)	Loss 3.2519e+00 (3.4150e+00)	Acc@1  28.12 ( 17.56)	Acc@5  46.88 ( 44.15)
Epoch: [2][441/986]	Time  0.383 ( 0.400)	Data  0.000 ( 0.037)	Loss 3.2058e+00 (3.4029e+00)	Acc@1  17.19 ( 18.09)	Acc@5  56.25 ( 44.83)
Epoch: [2][441/986]	Time  0.384 ( 0.396)	Data  0.000 ( 0.032)	Loss 3.5359e+00 (3.4134e+00)	Acc@1  18.75 ( 17.59)	Acc@5  39.06 ( 44.25)
Epoch: [2][451/986]	Time  0.372 ( 0.399)	Data  0.000 ( 0.037)	Loss nan (nan)	Acc@1   1.56 ( 17.87)	Acc@5   4.69 ( 44.34)
Epoch: [2][451/986]	Time  0.370 ( 0.396)	Data  0.000 ( 0.031)	Loss nan (nan)	Acc@1   0.00 ( 17.36)	Acc@5   1.56 ( 43.77)
Epoch: [2][461/986]	Time  0.367 ( 0.399)	Data  0.000 ( 0.036)	Loss nan (nan)	Acc@1   1.56 ( 17.50)	Acc@5   3.12 ( 43.51)
Epoch: [2][461/986]	Time  0.366 ( 0.395)	Data  0.000 ( 0.031)	Loss nan (nan)	Acc@1   6.25 ( 17.02)	Acc@5  10.94 ( 42.94)
Epoch: [2][471/986]	Time  0.367 ( 0.398)	Data  0.000 ( 0.036)	Loss nan (nan)	Acc@1   1.56 ( 17.14)	Acc@5   1.56 ( 42.69)
Epoch: [2][471/986]	Time  0.366 ( 0.394)	Data  0.000 ( 0.031)	Loss nan (nan)	Acc@1   0.00 ( 16.68)	Acc@5   3.12 ( 42.14)
Epoch: [2][481/986]	Time  0.367 ( 0.397)	Data  0.000 ( 0.036)	Loss nan (nan)	Acc@1   0.00 ( 16.81)	Acc@5   7.81 ( 41.89)
Epoch: [2][481/986]	Time  0.366 ( 0.394)	Data  0.000 ( 0.031)	Loss nan (nan)	Acc@1   3.12 ( 16.36)	Acc@5  15.62 ( 41.40)
Epoch: [2][491/986]	Time  0.368 ( 0.397)	Data  0.000 ( 0.036)	Loss nan (nan)	Acc@1   0.00 ( 16.48)	Acc@5   3.12 ( 41.12)
Epoch: [2][491/986]	Time  0.366 ( 0.393)	Data  0.000 ( 0.031)	Loss nan (nan)	Acc@1   1.56 ( 16.04)	Acc@5   4.69 ( 40.66)
Epoch: [2][501/986]	Time  0.367 ( 0.396)	Data  0.000 ( 0.035)	Loss nan (nan)	Acc@1   0.00 ( 16.16)	Acc@5   0.00 ( 40.38)

Check where the NaN value is created and e.g. if the forward pass of the model overflows or if it’s in the loss calculation etc. by printing values of intermediate tensors. Once you have narrowed down the operation causing the NaN (or Inf) value, you might get a better idea what is happening.
Sometimes the input values already contain invalid values, so you might start with these. torch.isfinite(tensor).all() might be useful for your tests.

I used the following block to test where the output is going to Nan

    def checking_iffinite(self, x, pp):
        if torch.isfinite(x).all():
            pass
        else:
            fl = open(os.path.join('outputs/jigpz_pretrain/run1', 'log.txt'), 'a')
            print("******Not finite*******", pp)
            fl.write("******Not finite*******"+ pp+'\n')
            fl.flush()

def _forward_impl(self, x: Tensor) -> Tensor:
        # See note [TorchScript super()]
        x = self.conv1(x)
        self.checking_iffinite(x, "_forward_impl conv1")
        x = self.bn1(x)
        self.checking_iffinite(x, "_forward_impl bn1")

I found where the output is going to Nan

******Not finite*******Bottleneck identity
******Not finite*******Bottleneck identity
******Not finite*******Bottleneck conv1
******Not finite*******Bottleneck conv1
******Not finite*******Bottleneck bn1
******Not finite*******Bottleneck relu
******Not finite*******Bottleneck bn1
******Not finite*******Bottleneck relu
******Not finite*******Bottleneck conv2
******Not finite*******Bottleneck bn2
******Not finite*******Bottleneck conv2
******Not finite*******Bottleneck relu
******Not finite*******Bottleneck bn2
******Not finite*******Bottleneck relu
******Not finite*******Bottleneck conv3
******Not finite*******Bottleneck conv3
******Not finite*******Bottleneck bn3
******Not finite*******Bottleneck out
******Not finite*******Bottleneck bn3
******Not finite*******Bottleneck out relu
******Not finite*******Bottleneck out
******Not finite*******_forward_impl layer2

It is going to Nan in forward implementation at layer2. What should I do next?

Did you check the actual input as well? Try to narrow it down to an operation using valid inputs and creating invalid ones.

Hello again,
I had a slight typo (in the error tracing code). After correcting it, I found the place where the error is occuring. It seems like the batch normalization i.e. nn.BatchNorm2d in the forward of Bottleneck block of Resnet is the culprit. Whereto from here?

Epoch: [1][301/986]	Time  0.532 ( 0.539)	Data  0.000 ( 0.033)	Loss 3.7381e+00 (3.9519e+00)	Acc@1   9.38 (  9.18)	Acc@5  39.06 ( 28.29)
Epoch: [1][311/986]	Time  0.538 ( 0.544)	Data  0.000 ( 0.039)	Loss 3.7849e+00 (3.9466e+00)	Acc@1  14.06 (  9.09)	Acc@5  25.00 ( 29.02)
Epoch: [1][311/986]	Time  0.536 ( 0.539)	Data  0.000 ( 0.032)	Loss 3.8315e+00 (3.9480e+00)	Acc@1   9.38 (  9.20)	Acc@5  31.25 ( 28.37)
******Not finite*******Bottleneck bn1
******Not finite*******Bottleneck relu
******Not finite*******Bottleneck bn2
******Not finite*******Bottleneck relu
******Not finite*******Bottleneck conv3
******Not finite*******Bottleneck bn3
******Not finite*******Bottleneck out
******Not finite*******Bottleneck out relu
******Not finite*******Bottleneck identity
******Not finite*******Bottleneck conv1
******Not finite*******Bottleneck bn1
******Not finite*******Bottleneck relu
******Not finite*******Bottleneck bn2

******Not finite*******_forward_impl layer3
******Not finite*******_forward_impl layer4
******Not finite*******_forward_impl avgpool
******Not finite*******_forward_impl flatten
******Not finite*******_forward_impl fc
Output in Train is not finite ::: 315iteration