Extremely high output values from the network (1e34) with no training with normalization

dvirginz · August 12, 2020, 7:24am

Hello, after experimenting with multiple off-the-shelf and written from scratch networks I am starting to feel there is something wrong with my network without being able to understand what:
My network

class MySubPixelCNN(nn.Module):
    def __init__(self, upscale_factor,num_features):
        super(MySubPixelCNN, self).__init__()

        self.relu = nn.ReLU()
        self.conv1 = nn.Conv2d(num_features, 64, kernel_size=5, stride=1, padding=2)
        self.bb1 = nn.BatchNorm2d(64)
        self.conv2 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1)
        self.bb2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 32, kernel_size=3, stride=1, padding=1)
        self.bb3 = nn.BatchNorm2d(32)
        self.conv4 = nn.Conv2d(32, num_features*upscale_factor ** 2, kernel_size=3, stride=1, padding=1)
        self.pixel_shuffle = nn.PixelShuffle(upscale_factor)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bb1(self.relu(x))
        x = self.conv2(x)
        x = self.bb2(self.relu(x))
        x = self.conv3(x)
        x = self.bb3(self.relu(x))
        x = self.conv4(x)
        x = self.pixel_shuffle(x)
        return x

My input images are textures (I.e not the ordinary Images), with a simply preprocessing to have them in [0,1].
Yet, sometimes (which makes the network immediately diverge)
model(input).max() > 1e30
As mentioned, I always add batch norm, and as the network is not that deep, I simply cannot understand what could’ve gone wrong.

ZdsAlpha · August 12, 2020, 3:00pm

Shouldn’t it be num_features * 2 ** upscale_factor?

dvirginz · August 12, 2020, 3:31pm

No, why? Is that might be the reason why the results vary from 0 to 1e34?

dvirginz · August 12, 2020, 6:24pm

I am now 100% certain that it is a problem with cuda 11, When using another GPU I have with cuda10.1 it works fine. I cannot prove it unfortunately, but I now understand that it started when I installed cuda11 and conda install pytorch=10.2 as previous posts here suggested it is fine. I assume it is considered a bug, how can I nail the problem and report it?

ptrblck · August 15, 2020, 6:59am

Could you post some information for this setup, i.e.:

which GPU are you using
how did you install CUDA11 and which version exactly
which cudnn version are you using
which PyTorch commit are you using
are you seeing the extremely high loss after a certain number of iterations for random inputs in [0, 1]?