After torch::load model and predict, then got NaN

sudri · September 30, 2021, 2:23am

When I was training and validating the model, the output was all normal. After training, I called torch::save() to save the model to a .pt file, and then called torch::load() to load the model from the file to make predictions. At this time, the predicted value becomes NaN. I have checked the predicted network structure and network data type(Double), input data and its data type(Double), they are all correct. I have been troubled by this problem for a few days, please help me.

Network

struct NetImpl : torch::nn::Module {
    NetImpl(int in_feature, int out_feature)
        : fc1(in_feature, 10),
        fc2(10, 100),
        fc3(100, 500),
        fc4(500, 250),
        fc5(250, 125),
        fc6(125, 70),
        fc7(70, out_feature)
    {
        register_module("fc1", fc1);
        register_module("fc2", fc2);
        register_module("fc3", fc3);
        register_module("fc4", fc4);
        register_module("fc5", fc5);
        register_module("fc6", fc6);
        register_module("fc7", fc7);
    }

    torch::Tensor forward(torch::Tensor x) {
        x = torch::leaky_relu(fc1->forward(x));
        x = torch::leaky_relu(fc2->forward(x));
        x = torch::leaky_relu(fc3->forward(x));
        x = torch::leaky_relu(fc4->forward(x));
        x = torch::leaky_relu(fc5->forward(x));
        x = torch::leaky_relu(fc6->forward(x));
        x = fc7->forward(x);
        return x;
    }
    torch::nn::Linear fc1, fc2, fc3, fc4, fc5, fc6, fc7;
};
TORCH_MODULE(Net);

Save and load then got output as below

torch::save(model, model_path);
torch::load(model, model_path);
std::cout << model << std::endl;

NetImpl(
(fc1): torch::nn::Linear(in_features=5, out_features=10, bias=true)
(fc2): torch::nn::Linear(in_features=10, out_features=100, bias=true)
(fc3): torch::nn::Linear(in_features=100, out_features=500, bias=true)
(fc4): torch::nn::Linear(in_features=500, out_features=250, bias=true)
(fc5): torch::nn::Linear(in_features=250, out_features=125, bias=true)
(fc6): torch::nn::Linear(in_features=125, out_features=70, bias=true)
(fc7): torch::nn::Linear(in_features=70, out_features=2, bias=true)
)

Predict Method

template <typename DataLoader>
void predict(Net& model,
    DataLoader& data_loader,
    size_t dataset_size) {
    //torch::NoGradGuard no_grad;
    model->eval();
    model->zero_grad();
    auto loss_val = 0.0;
    for (auto& batch : *data_loader) {
        auto data = batch.data, target = batch.target;
        auto output = model->forward(data);
        std::cout << output << std::endl;
        loss_val += torch::mse_loss(output, target).template item<float>();
    }
    loss_val /= dataset_size;
    std::printf("\nTest set: Average loss: %.8f\n", loss_val);
}

Result

Model File
If necessary, I will upload it to Google Drive.

Environment
libtorch stable 1.9.1
libtorch preview(nightly)
Microsoft Visual Studio 2019
Windows 10

Any suggestions and help are greatly appreciated!

UPDATE
When the tensor performs the first few forward operations, the output value produces several NaNs.

 x = torch::leaky_relu(fc2->forward(x));
std::cout << x << std::endl;

If tensor continue to forward, all subsequent outputs will be NaN. I think this may be the cause of the problem, but how should it be solved?

ptrblck · September 30, 2021, 9:05am

Could you check the output of fc1 and its values?
The printed values of x = torch::leaky_relu(fc2->forward(x)); have a huge range so it would be interesting to see what ranges the input to the model has (min. and max. in particular) as well as the parameters of fc1.

sudri · September 30, 2021, 9:30am

Hi, Thanks for help.

I made a small change in order to observe the change in value. The code and output results are shown below.

std::cout << x << std::endl;  // print normalized input data

x = fc1->forward(x);
std::cout << x << std::endl;

x = torch::leaky_relu(x);
std::cout << x << std::endl;

x = fc2->forward(x);
std::cout << x << std::endl;

I built the same network on PyTorch, but his output was normal, with no large or small values. It is worth mentioning that in LibTorch, I initialized the network to Double type by using code below, maybe this is a problem?

model->to(device, torch::kDouble);

Looking forward to your reply, Thanks!

ptrblck · September 30, 2021, 10:33am

Thanks the the update! No, double shouldn’t be a problem.
So it seems that the fc1 layer already creates these large values. Could you check the weight and bias of this layer and post it here, please?

sudri · September 30, 2021, 11:33am

Thanks for your suggestion, I checked the weight and bias of fc1. As shown below, none of them are 0, but their range of values is indeed very large. For this situation, I have some confusion and don’t know what to do next.

I save and load the model by using the following code.

Net model = Net(in_feature, out_feature);
// some train and test code
torch::save(model, model_path);  // model_path is like "xxx.pt"

// initialize new model
Net new_model = Net(in_feature, out_feature);
torch::load(new_model, model_path);
new_model->to(device, torch::kDouble);
// and predict...

Thank you very much and look forward to your reply!

ptrblck · September 30, 2021, 5:28pm

So apparently something goes wrong during the saving or loading.
Is it possible to load the model and/or parameters in Python? If so, are you seeing different parameters?

Could you also just run this check in a single code snippet:

create the model
check parameter stats
save the model
load the model right afterwards
check parameters again

sudri · October 2, 2021, 2:10pm

After trying again and again, I finally found the problem. Let me briefly describe the prerequisites and solutions.

Prerequisites

The model parameter type is Double
The input parameter is also of type Double
Therefore, the model saved by torch::save is also of Double type

Solutions
After LibTorch saves the model, before loading the model, the network needs to be initialized first. If you want to read an offline network file of type Double, you must first initialize the network to type Double, then load the model from file, which is very, very important. The sample code is shown below.

// Initialize the network
Net model = Net(in_feature, out_feature);
model->to(device, torch::kDouble);
// some code to train and validate and more...
torch::save(model, "test.pt");

// Load the model
Net new_model = Net(in_feature, out_feature);
new_model->to(device, torch::kDouble); // very very important before load
torch::load(new_model, "test.pt");

In summary, whether it is to save the model or load the model, the data type must be consistent throughout the process.

Thank you very much for all your suggestions and help!
Have a good day!

ptrblck · October 2, 2021, 11:32pm

Thanks a lot for the detailed explanation!
I would consider this a nasty bug, as it’s causing silently wrong answers so thanks again for the debugging!
I’ll create an issue to track and fix this.