Getting Nan value only on CPU

Kareem_Metwaly · March 19, 2020, 1:09am

Hi,
I have a saved model that I’m trying to load.
The training was initially done over the GPU.
When I load it to the GPU, it works fine. However, when I load it to the CPU, sometimes it works and sometimes it doesn’t (I have some Nan values).

I’m using the following functions to save and load the model,

def save_checkpoint(model, epoch, opt, save_path):
    model_out_path = os.path.join(save_path, "model_epoch_{}.pth".format(epoch))
    if opt.cuda:
        model_file = model.module.__class__.__module__
        model_class = model.module.__class__.__name__
        model_state = model.module.state_dict()
    else:
        model_file = model.__class__.__module__
        model_class = model.__class__.__name__
        model_state = model.state_dict()

    state = {"epoch": epoch,
             "model_state": model_state,
             "opt": opt,
             "args": sys.argv,
             "model_file": model_file,
             "model_class": model_class}

    # check path status
    if not os.path.exists("model/"):
        os.makedirs("model/")

    # save model
    torch.save(state, model_out_path)
    print("Checkpoint saved to {}".format(model_out_path))


def load_checkpoint(path, **extra_params):
    if 'iscuda' in extra_params:
        iscuda = extra_params.pop('iscuda')
    else:
        iscuda = True
    if iscuda:
        dict = torch.load(path, map_location='cuda')
    else:
        dict = torch.load(path, map_location='cpu')
    print(dict["model_file"])
    print(dict["model_class"])
    md = importlib.import_module(dict["model_file"])
    if extra_params:
        model = eval("md.{}".format(dict["model_class"]))(**extra_params)
    else:
        model = eval("md.{}".format(dict["model_class"]))()
    model.load_state_dict(dict["model_state"])
    return model, dict["epoch"], dict

ptrblck · March 19, 2020, 9:31am

While the checkpointing code looks a bit complicated, I cannot see any obvious error.
Could you load two models, one on the CPU and the other one on the GPU, and compare all parameters of these models?

Also, I assume your data pipeline hasn’t changed between the CPU and GPU runs?
Could you nevertheless check the inputs for NaN and Inf values?

Kareem_Metwaly · April 6, 2020, 5:49pm

Sorry for the very late reply, I couldn’t use the GPU to test the code till today.
I used the following code to compare the two loaded models on the cpu and the gpu

import argparse
import torch
from torch.autograd import Variable
import scipy.io as sio
import os
import utils

parser = argparse.ArgumentParser(description="Pytorch DRRN Eval")
parser.add_argument("--model", type=str, default="dense_A_t_J_w_regW", help="model path")
parser.add_argument("--ep", type=int, default=24, help="epoch path")

opt = parser.parse_args()

save_path = os.path.join('/home/krm/ext/Dehazing/results/', opt.model)
utils.checkdirctexist(save_path)

model_path = os.path.join('model', opt.model, "model_epoch_{}.pth".format(opt.ep))

model, epoch, dict = utils.load_checkpoint(model_path, iscuda=False)
model_cuda, epoch, dict = utils.load_checkpoint(model_path, iscuda=True)

not_equal = []
for (name, p), (name_cuda, p_cuda) in zip(model.named_parameters(), model_cuda.named_parameters()):
    if not (p==p_cuda).any():
        not_equal.append(name)

print("not_equal = ")
print(not_equal)

Apparently they are identical. I have the problem with some test images as well, where I’m directly reading the images using opencv. So, there is no problem with NaNs or Infs in the input.
The problem is not consistent as well, sometimes it happens and sometimes it doesn’t, even for the same exact input in testing! the output is not the same when I run the code several times on the cpu. It is always consistent on the gpu, however.
I don’t know but could there be somehow a problem with the OS?

ptrblck · April 7, 2020, 1:59am

Could you check the numpy version in your environment, please?
Recently we had an issue, where an older numpy was used (1.14 I think), which caused some trouble with NaN values.

Kareem_Metwaly · April 7, 2020, 4:00am

Thanks for your reply.
I have two conda environment, one uses numpy 1.18.1 and another uses 1.16.4
Unfortunately, I am not sure whether the problem exists in both environments or not.

ptrblck · April 7, 2020, 6:14am

Could you try to run the code in the environment with numpy==1.18.1 please?

Kareem_Metwaly · April 7, 2020, 3:18pm

It works fine there.
As I said before, it is not always consistent. Sometimes it happens and sometimes it doesn’t.
I will mark it as solved for now, unless something new happens.
Thanks again

Prerak_Srivastava · March 25, 2021, 11:39pm

@ptrblck
Hello I am getting the same problem ,
whenever i am trying to load a saved model , both on cpu as well as on gpu , more than often i am getting nan values.
This behavior is quite inconsistent on both CPU and GPU.
Basically my model is an ensemble-based model, and it consists of 3 different models being used together at once and all the three models are saved on one single file .pt file.
The absurd thing is i am successfully able to load model 1 and model 3 from the same file and they aren’t returning nan values while model 2 is returning nan values
I am using numpy 1.19.4
Any solution to this

ptrblck · March 26, 2021, 7:11am

Could you check the parameters of the model before saving and after loading?
Based on your description it seems that your model is somehow corrupted during the serialization.

Prerak_Srivastava · March 26, 2021, 3:23pm

So i checked, and the parameters(weights and biases) of the model before saving and after loading turns out to be same.
I am getting nan values every time i try to train and load this ensemble type of model, basically, the second pipeline only returns nan values while the first pipeline doesn’t return nan values, although both are being saved in the same “.pt” file.

It’s happening every time when I try to save the model,

J_Johnson · March 26, 2021, 3:27pm

Try using:

Save:

torch.save(model.state_dict(), PATH)

Load:

model = TheModelClass(*args, **kwargs) 
model.load_state_dict(torch.load(PATH)) 
model.eval()

https://pytorch.org/tutorials/beginner/saving_loading_models.html

I had the same issue with loading the entire model. But just loading the state_dict performed very well on future training/evaluation.

Prerak_Srivastava · March 26, 2021, 3:33pm

I am already using this. @J_Johnson
and when i try to load net_1 for evaluation the data seems corrupted

save best model

if epoch == 0:
    save_best_val = val_data_ar[-1]
    np.save(path+"mlh_dummy_input_mean_sh.npy", adcc)
elif save_best_val > val_data_ar[-1]:
    torch.save(
        {'model_dict_1': net_1.state_dict(),  'model_dict_3':net_3.state_dict(),'model_dict_ens': net.state_dict(),
         'optimizer_dic': optimizer.state_dict(), 'epoch': epoch, 'loss': val_data_ar[-1]},
        path+"mlh_tas_save_best_sh.pt")
    save_best_val = val_data_ar[-1]
    np.save(path+"mlh_bnf_mag_96ms_" + str(epoch) + ".npy", local_dt_sp)

type or paste code here

Prerak_Srivastava · March 26, 2021, 3:41pm

#Model Load Script

device=torch.device("cpu")
model_1=net.Model_1().to(device="cpu")
model_3=net.Model_3().to(device="cpu")
model_4=net.Ensemble(model_1,model_3,1).to(device="cpu")
model_4.batch_size=1

chkp=torch.load(path+"mlh_tas_save_best_sh.pt",map_location=device)
#print(chkp['model_dict_2'])
model_1.load_state_dict(chkp['model_dict_1'])
model_3.load_state_dict(chkp['model_dict_3'])
model_4.load_state_dict(chkp['model_dict_ens'])

ptrblck · March 26, 2021, 8:22pm

Could you clarify these two statements, please?
Is the input data corrupted, since you’ve already checked, that all parameters are equal?

Could you run a constant input, e.g. torch.ones, through the model before saving and after loading the model and compare the outputs?

Prerak_Srivastava · March 26, 2021, 10:45pm

@ptrblck
First i explain how my architecture looks like

Net_1(inp1,inp2):
out_1=series of convolution layers for (inp1)
out_2=series of convolution layers for (inp2)
return out_1,out_2

Net_2(inp1)
out_3 = Series of fcc (inp1)
return out_3

Net Ensemble(inp1,inp2)
out_1,out_2=Net_1(inp1,inp2)
z= Do some operation with (out_1,out_2)
out_f=Net_2(z)
return out_f

Backprop using net ensemble.

After digging through, i found this
There are two scenarios that turns out when i load the saved model.
a)First Case: When i loads the saved model , it loads the same parameters (weights and biases) of all the layers present in my architecture during training, but in this case i get nan value for only out_2 from Net_1, but i dont get nan’s for out_1.
b)Second Case: . When i loads the saved model , it returns different values [dump] for all the parameters of all the layers present in my architecture, in this case i don’t get nan from both out_1 and out_2
All this happens while loading the same saved file , so both the cases occurs inconsistently.
I ran the architecture for 120 epochs , and the model seems to perform as expected while training and i don’t get nan’s for both out_1,out_2 while training the model , this issue only arises after i save my model.
For debugging purposes i used torch.ones as input for inp_1,inp_2
I don’t understand how to solve this issue.
The virutal enviroment is same for training and loading purposes, except i am using GPU while training and CPU while loading, although when i train on GPU and also load on GPU the error persist.

ptrblck · March 27, 2021, 12:34am

Thanks for the update.
So based on the description of both cases: sometimes loading the model returns the expected values, other times it doesn’t.
Could you post the model architectures as well as the code to save and load the models?

Prerak_Srivastava · March 28, 2021, 9:46pm

#Model Architecture

class Model_1(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.batch_size = 128

        **Pipeline 1**
        self.conv1d_down_1_depth = nn.Conv1d(769, 769, kernel_size=10, stride=1, groups=769, padding=0)
        self.conv1d_down_1_point = nn.Conv1d(769, 384, kernel_size=1, stride=1, padding=0)
        self.bn_1 = nn.LayerNorm([384, 54])

        self.relu=nn.ReLU()
        
        self.conv1d_down_2_depth = nn.Conv1d(384, 384, kernel_size=10, stride=1, groups=384, dilation=2, padding=0)
        self.conv1d_down_2_point = nn.Conv1d(384, 192, kernel_size=1, stride=1)
        self.bn_2 = nn.LayerNorm([192, 36])

        self.conv1d_down_3_depth = nn.Conv1d(192, 192, kernel_size=2, stride=1, groups=192, dilation=4, padding=0)
        self.conv1d_down_3_point = nn.Conv1d(192, 96, kernel_size=1, stride=1)
        self.bn_3 = nn.LayerNorm([96, 32])

        **Pipeline 2**
        self.pip_conv_1d = nn.Conv1d(2307, 2307, kernel_size=10, stride=1, groups=2307, padding=0)
        self.pip_conv_1p = nn.Conv1d(2307, 1152, kernel_size=1, stride=1, padding=0)
        self.bn_pip_1 = nn.LayerNorm([1152, 54])
        
        self.pip_conv_2d = nn.Conv1d(1152, 1152, kernel_size=10, stride=1, groups=1152, dilation=2, padding=0)
        self.pip_conv_2p = nn.Conv1d(1152, 576, kernel_size=1, stride=1, padding=0)
        self.bn_pip_2 = nn.LayerNorm([576, 36])

        self.pip_conv_3d = nn.Conv1d(576, 576, kernel_size=2, stride=1, groups=576, dilation=4, padding=0)
        self.pip_conv_3p = nn.Conv1d(576, 288, kernel_size=1, stride=1, padding=0)
        self.bn_pip_3 = nn.LayerNorm([288, 32])

        

        self.drp_1=nn.Dropout(p=0.2)

        self.drp = nn.Dropout(p=0.5)



       

    def forward(self, x, x2):
        

        x = self.relu(self.conv1d_down_1_depth(x))

        x = self.bn_1(self.relu(self.conv1d_down_1_point(x)))



        x = self.drp_1(x)

        
        
        x2=self.relu(self.pip_conv_1d(x2))
        x2=self.bn_pip_1(self.relu(self.pip_conv_1p(x2)))

        x2=self.drp_1(x2)

        x = self.relu(self.conv1d_down_2_depth(x))

        x = self.bn_2(self.relu(self.conv1d_down_2_point(x)))



        x = self.drp_1(x)


        x2=self.relu(self.pip_conv_2d(x2))
        x2=self.bn_pip_2(self.relu(self.pip_conv_2p(x2)))

        x2=self.drp_1(x2)

        

        x = self.relu(self.conv1d_down_3_depth(x))

        x = self.bn_3(self.relu(self.conv1d_down_3_point(x)))

        x2=self.relu(self.pip_conv_3d(x2))
        x2=self.bn_pip_3(self.relu(self.pip_conv_3p(x2)))

        #x Output pipeline 1
        #x2 Output pipeline 2

        return x,x2


class Model_3(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.batch_size = 128
        self.drp = nn.Dropout(p=0.5)
        self.fc_1 = nn.Linear(384, 96)
        self.fc_2 = nn.Linear(96, 48)
        self.fc_3 = nn.Linear(48, 28)
        self.softplus = nn.Softplus()

    def forward(self, x):
        
        x = self.fc_1(x)
        
        x = self.drp(x)
        x = self.fc_3(self.fc_2(x))
        
        mean, variance = x[:, :14], x[:, 14:]
        
        variance = self.softplus(variance)

        return mean, (variance + 10e-7)


class Ensemble(torch.nn.Module):
    def __init__(self, model1, model2,bs):
        super().__init__()
        self.batch_size = bs
        self.model_a = model1
        self.model_c = model2
        self.avgpool = nn.AvgPool1d(32, stride=1)

    def forward(self, ch1, ch2):
        x,x2 = self.model_a(ch1,ch2)

        x = torch.cat((x, x2), axis=1)
        
        x = self.avgpool(x)
   
        x = x.reshape(self.batch_size, -1)
        
        mean, variance = self.model_c(x)

        return mean, variance




train_dl = DataLoader(train_data, batch_size=128, shuffle=True, num_workers=0, drop_last=True)
val_dl = DataLoader(val_data, batch_size=128, shuffle=True, num_workers=0, drop_last=True)

net_1 = Model_1().to(torch.device("cuda"))

net_3 = Model_3().to(torch.device("cuda"))

net = Ensemble(net_1, net_3,128).to(torch.device("cuda"))



optimizer = optim.Adam(net.parameters(), lr=0.0001)



ar_loss = []
batch_loss_ar = []
total_batch_idx = 0
val_data_ar = []
acc_data_ar = []
save_best_val = 0
adcc = np.zeros((1, 14))
track_var = np.zeros((1, 14))
local_dt_sp = np.zeros((1, 44))



for epoch in range(100):
    ar_loss, batch_loss_ar, adcc, track_var = train(net, train_dl, optimizer, epoch, ar_loss, batch_loss_ar)
    val_data_ar, acc_data_ar, local_dt_sp = val(net, val_dl, optimizer, epoch, val_data_ar, acc_data_ar)

    np.save(path+"mlh_ar_loss.npy", ar_loss)
    np.save(path+"mlh_batch_loss_ar.npy", batch_loss_ar)
    np.save(path+"mlh_val_data_ar.npy", val_data_ar)
    np.save(path+"mlh_acc_data_ar.npy", acc_data_ar)
    np.save(path+"mlh_bnf_track_" + str(epoch) + "_var_.npy", track_var)

    # save best model
    if epoch == 0:
        save_best_val = val_data_ar[-1]
        np.save(path+"mlh_dummy_input_mean_sh.npy", adcc)
    elif save_best_val > val_data_ar[-1]:
        debug_1,debug_2=net_1(torch.ones(m1,m2,m3),torch.ones(m1,m4,m5))
        print(debug_1) #Do not get nan
        print(debug_2) #Get nan values , while training its not the case , trains without nan's

        torch.save(
            {'model_dict_1': net_1.state_dict(),  'model_dict_3':net_3.state_dict(),'model_dict_ens': net.state_dict(),
             'optimizer_dic': optimizer.state_dict(), 'epoch': epoch, 'loss': val_data_ar[-1]},
            path+"mlh_tas_save_best_sh.pt")
        save_best_val = val_data_ar[-1]
        np.save(path+"mlh_bnf_mag_96ms_" + str(epoch) + ".npy", local_dt_sp)

So if i do net_1(torch.ones(m1,m2,m3),torch.ones(m1,m4,m5)) i get nan for x2 value while i don’t get nan for x1 value .
The point to note is while training the same model i don’t get nan on x and on x2.
I also checked the model while running just the second pipeline, and found that the problem persists only with second pipeline.
The loss function here is Negative Log-Likelihood loss.
Debuged more i found that i start getting nan values from self.pip_conv_1p layer output in pipeline-2

#Model Loading Script

device=torch.device("cpu")
model_1=net.Model_1().to(device="cpu")
model_3=net.Model_3().to(device="cpu")
model_4=net.Ensemble(model_1,model_3,1).to(device="cpu")
model_4.batch_size=1

chkp=torch.load(path+"mlh_tas_save_best_sh.pt",map_location=device)
model_1.load_state_dict(chkp['model_dict_1'])
model_3.load_state_dict(chkp['model_dict_3'])
model_4.load_state_dict(chkp['model_dict_ens'])

xwy · September 7, 2021, 3:35am

Got the same problem, any update on this thread?

J_Johnson · November 3, 2021, 11:35am

Have you guys tried setting the fill values of the parameters manually in the model init?

For example:

fill_val=0.000000001
for m in self.modules():
    if isinstance(m, nn.Conv2d):
        if not isinstance(m.bias, type(None)):
            m.bias.data.fill_(fill_val)

for m in self.modules():
    if isinstance(m, nn.Linear):
        if not isinstance(m.bias, type(None)):
            m.bias.data.fill_(fill_val)

Can also try lowering your learning rate.