Different Losses on 2 different machines

halahup · February 5, 2019, 4:24pm

Hi, I came across a problem:

I am running the same model on 2 different machines, one is a single GPU 1080Ti and the other one is 2x GPUs RTX2080Ti, but the model is running on a single 2080Ti.

The problem is that the model runs just fine on my single 1080Ti GPU, however, when the model is run on the other machine I get nan for losses:

If anyone has experienced this weird behavior, please advise on what the problem may be.

PS. The both machines are running PyTorch 1.0.0, CUDA 10 and cuDNN 7.4.

ptrblck · February 5, 2019, 5:38pm

I assume you are using exactly the same dataset? Are you using FP16 precision in your models? Are both machines running the same OS?

halahup · February 5, 2019, 6:08pm

Hi, yes exactly same datasets, the tensors are all in float32 so single precision on both machines and the OS is the same: Windows 10 Pro.

ptrblck · February 5, 2019, 10:34pm

Thanks for the info!
Are the other values (LAE and ELoss) for both runs comparable before they get nan using your RTX2080Ti?
Since DLoss is immediatelly nan for the first batch, could you check for invalid input values using torch.isnan and torch.isinf?
If you can’t spot anything, would it be possible to post a (small) executable code snippet?

anima · February 6, 2019, 12:17am

May be when one of your models intercept an exploding gradient point, then some of your weights may overflow, which eventually may corrupt your values as nan. Could you please

try it with lowering weight initializations and learning rate?
or
Normalize your data values?
https://stackoverflow.com/questions/51033066/pytorch-loss-inf-nan

halahup · February 6, 2019, 12:59am

Yes, the LAE and the ELoss are pretty similar in the first epoch on both machines. I noticed that activations and gradients blow up in the very first batch for no apparent reason, I will post a screen shot tomorrow, since the more powerful machine is my work machine.

I will try to code up a little snippet tomorrow for the encoder part of the model so you could try to reproduce the behavior. Thank you for your help.

halahup · February 6, 2019, 1:00am

The point is that the model is exactly the same on both machines, while one works just fine, the other one consistently blows up, I will post more screenshots tomorrow.

halahup · February 6, 2019, 4:59pm

So we are using a residual block for our model, the block looks like this:

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, triplet=True):
        super(ResidualBlock, self).__init__()
        
        # conv layers
        self.conv_res = nn.Conv2d(in_channels, out_channels, kernel_size=1, padding=0)
        self.conv_1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)
        self.conv_2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
        
        # initialization
        kaiming_normal_(self.conv_res.weight, nonlinearity='relu')
        kaiming_normal_(self.conv_1.weight, nonlinearity='relu')
        kaiming_normal_(self.conv_2.weight, nonlinearity='relu')
        
        # batch norm layers
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.bn3 = nn.BatchNorm2d(out_channels)
        
        # activations 
        self.relu = nn.ReLU()

        # flags
        self.triplet = triplet

    def forward(self, x):
        
        # out placeholder
        out = None
        
        # if the block is a triplet
        if self.triplet:
            
            # first stage
            out_1 = self.conv_1(x)
            out_1 = self.bn1(out_1)
            out_1 = self.relu(out_1)
            
            # second stage
            out_2 = self.conv_2(out_1)
            out_2 = self.bn2(out_2)
            
            # third stage - residual connection
            out_res = self.conv_res(x)

        # if the block is a twin
        else:
            
            # first stage
            out_1 = self.conv_1(x)
            out_1 = self.bn1(out_1)
            out_1 = self.relu(out_1)
            
            # second stage
            out_2 = self.conv_2(out_1)
            out_2 = self.bn2(out_2)
            
        # add the activations
        if self.triplet:
            out = out_2 + out_res
        else:
            out = out_2 + x

        # final activation function
        out = self.relu(out)
            
        return out

The model is basically a stack of this residual blocks. Here is the encoder part of it:

class Encoder(nn.Module):
    def __init__(self):
        super(Encoder, self).__init__()
        
        # conv layer
        # using 3 by 3 receptive field
        self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
        
        # initialize the conv layer
        kaiming_normal_(self.conv.weight.data, nonlinearity='relu')  # init with kaiming he

        # residual blocks
        self.res1 = ResidualBlock(in_channels=16, out_channels=32)
        self.res2 = ResidualBlock(in_channels=32, out_channels=64)
        self.res3 = ResidualBlock(in_channels=64, out_channels=128)
        self.res4 = ResidualBlock(in_channels=128, out_channels=256)
        self.res5 = ResidualBlock(in_channels=256, out_channels=512)

        # fully connected layer
        self.fc1024 = nn.Linear(in_features=8192, out_features=1024)
        
        # initialize the fc
        kaiming_normal_(self.fc1024.weight.data, nonlinearity='relu')
        
        # pool layer
        self.pool = nn.AvgPool2d(kernel_size=2)
        
    def forward(self, x):
        
        # x is 3x256x256
        
        # input block
        out = self.conv(x)    # 16x256x256
        out = self.pool(out)  # 16x128x128
        
        print("CONV MAX ACT: ", out.max())
        
        # residual block 1
        out = self.res1(out)  # 32x128x128
        out = self.pool(out)  # 32x64x64
        
        print("RES1 MAX ACT: ", out.max())
        
        # residual block 2
        out = self.res2(out)  # 64x64x64
        out = self.pool(out)  # 64x32x32
        
        print("RES2 MAX ACT: ", out.max())
        
        # residual block 3
        out = self.res3(out)  # 128x32x32
        out = self.pool(out)  # 128x16x16
        
        print("RES3 MAX ACT: ", out.max())
        
        # residual block 4
        out = self.res4(out)  # 256x16x16
        out = self.pool(out)  # 256x8x8
        
        print("RES4 MAX ACT: ", out.max())
        
        # residual block 5
        out = self.res5(out)  # 512x8x8
        out = self.pool(out)  # 512x4x4
        
        print("RES5 MAX ACT: ", out.max())
        
        print("CONV MAX WEIGHT: ", self.conv.weight.max(), self.conv.bias.max())
        print("RES1 MAX WEIGHTS: ", self.res1.conv_1.weight.max(), self.res1.conv_2.weight.max(), self.res1.conv_res.weight.max())
        print("RES2 MAX WEIGHTS: ", self.res2.conv_1.weight.max(), self.res2.conv_2.weight.max(), self.res2.conv_res.weight.max())
        print("RES3 MAX WEIGHTS: ", self.res3.conv_1.weight.max(), self.res3.conv_2.weight.max(), self.res3.conv_res.weight.max())
        print("RES4 MAX WEIGHTS: ", self.res4.conv_1.weight.max(), self.res4.conv_2.weight.max(), self.res4.conv_res.weight.max())
        print("RES5 MAX WEIGHTS: ", self.res5.conv_1.weight.max(), self.res5.conv_2.weight.max(), self.res5.conv_res.weight.max())
        
        
        print("OUT_PRE MAX: ", out.max())
        
        # reshape
        out = out.view((out.shape[0], -1))

        # fully connected
        out = self.fc1024(out)
        
        print("OUT_SHAPE: ", out.shape, "OUT: ", out, "OUT MAX: ", out.max())
        
        # get the mean and log-variance
        mu = out[:, :512]
        logvar = out[:, 512:]
        
        # reparametrization trick
        z = self.reparameterize(mu, logvar)
        
        # return the latent vector, mu and logvar
        return z, mu, logvar

The print statements print the next stats:

This is on the first batch of the first epoch.

ptrblck · February 6, 2019, 5:12pm

Thanks for this part of the code!
Could you try to save the state_dict of our model running on the 1080Ti machine and load it on the 2080Ti one?
If you also get nan values, we would have to locate the exact code where the nan is generated.
torch.autograd.detect_anomaly might be helpful for this.

halahup · February 6, 2019, 6:59pm

Thank you for suggestions, loading the model with the state dict from working model yields the same issues.

Here is the exception raised by detect_anomaly:

rasbt · February 6, 2019, 7:09pm

Before looking into it further, to rule out randomness, have you

used a fixed random seed for weight initialization and minibatch shuffling?
used the same algorithms (e.g., same convolution operation)? I.e., torch.backends.cudnn.deterministic = True

EDIT: Another thing to check

Yes, the LAE and the ELoss are pretty similar in the first epoch on both machines. I noticed that activations and gradients blow up in the very first batch for no apparent reason,

I’ve seen that before on old PyTorch versions with log_softmax (I think there was a bug fix at some point). Are both machines running PyTorch 1.0?

halahup · February 6, 2019, 7:12pm

Hi, I have not, however, the differences are astronomical and I don’t think it is attributed to random seed. Plus as @ptrblck suggested I loaded the state from my working model, while it still produced the same issue.

rasbt · February 6, 2019, 7:14pm

Hi, I have not, however, the differences are astronomical and I don’t think it is attributed to random seed.

I’ve seen that happen before. Same code runs fine sometimes, sometimes not depending on the shuffling order. Really depends on the architecture, but exploding or vanishing gradient problems can easily accrue if you have long architectures and just one small or big multiplication at some point. Just for comparison purposes, I would at least set cuDNN to deterministic for now to make sure both cards are using the same algorithms to help with the further debugging

ptrblck · February 6, 2019, 7:55pm

Yeah, that’s a good point. I would also try to set all seeds and use deterministic methods first.
Then I would suggest to use the data of the first batch in both scripts to narrow down the difference between both machines.

halahup · February 6, 2019, 8:43pm

@rasbt @ptrblck I have set:

torch.manual_seed(42)
torch.backends.cudnn.deterministic = True

for both networks on my machine and the work machine. Also, I saved the network state in the working model and loaded it into not working model. Also, I used only 16 same images this time in both models without shuffling.

Here is the output from the working model:
Home_rig_training

This is the output from the other model:

Here is the output of the detect_anomaly, however, the exception is different sometimes:

ptrblck · February 6, 2019, 8:52pm

Could you disable cuDNN and try it again?

torch.backends.cudnn.enabled = False

halahup · February 6, 2019, 8:58pm

Without cuDNN enabled, the exception is the same every run:

halahup · February 6, 2019, 9:06pm

@ptrblck So, I unplugged one of the RTX2080Ti, even though I wasn’t using it, and left the other one in. I restarted the machine and ran the same script with no issues occurring (with both cuDNN on and off):

However, I am not sure why the losses are different from the other machine, since we fixed the seed.

ptrblck · February 6, 2019, 9:16pm

That’s interesting. It sounds like an hardware issue then.
Did you install PyTorch via pip/conda or build from source?
Could you compare the versions via print(torch.__version__)?

halahup · February 6, 2019, 9:54pm

Both were 1.0.1 installed through conda. So, one of the cards always results in issues running the model, the other one runs just fine (I tried them both separate in different PCIe ports). I am not sure how CUDA processes the information on the device but it seems that something wrong with the GPU and I am going to try to exchange it (not sure if not correctly working CUDA is a legit reason to return a GPU though). Thank you @ptrblck and @rasbt for your help.