Issues with exported copy of trained model

PresidentDoggo · June 18, 2020, 12:55pm

Hi there,

I’m trying to export a trained copy of my model for use in another python script. However, when a sample of labelled training data is passed through the network it predicts the wrong class. This should not be the case as I am getting around 98% accuracy on both training and testing?

Another issue is that after the initial prediction, it will constantly predict the same class for every sample passed through the network. The only way to change the predictions after it gets stuck like that is to save a copy of the trained network again. ~ This could be an issue with Jupyter notebooks though? I have tried restarting the kernel and clearing all variables after trying each new sample.

All the data is preprocessed and scaled identically to the original training and testing script.

Is it even possible to use saved states in this way?
I am new to pytorch - so it is also likely that I have made some kind of grave error somewhere else.

After training the network I am saving the model state dictionary:

# Save trainined network
PATH = "models/ANNKDD.pt"
torch.save(network.state_dict(), PATH)

From my understanding state_dict saves a dictionary of weights of the network at the moment the save function is called.

Then in a new python script I am loading in the trained network using:

PATH = "models/ANNKDD.pt"
network = Network()
network.load_state_dict(torch.load(PATH))
network = network.eval()

with torch.no_grad():
    predictions = network(X)

Tested on both pytorch 1.4.0 and 1.6.0

Any insight into this problem would be much appriciated, thanks!

ptrblck · June 19, 2020, 7:55am

The code to save and load the model looks alright.
I would recommend to store a processed tensor as well as the output in the training script and compare it to the output using this tensor in the evaluation script.
If you get the same results, the difference might come from the processing or the input tensors.
However, if the result is different, we would need to look into the model for further debugging.

PresidentDoggo · June 19, 2020, 2:21pm

Hi @ptrblck thanks for replying.

I have a very simple model

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        self.fc1 = nn.Linear(in_features = 41, out_features = 12) 
        self.fc2 = nn.Linear(in_features = 12, out_features = 18)
        self.out = nn.Linear(in_features = 18, out_features = 23) 
    def forward(self, t):
        t = F.relu(self.fc1(t))
        t = F.relu(self.fc2(t))
        t = self.out(t)
        return t

Using the same data sample:

Output tensor from the training script -

tensor([[-5.8796,  0.1122, -2.8794,  0.9305, -3.4934, -7.0429, -0.1777, -2.3283,
         -1.0330, -3.4359, -3.4113,  1.5217, -1.6634, -1.9250, -5.6562, -3.7723,
         -1.8225, -4.3986, -4.6337, -2.6478, -5.8153, -2.6518, -0.4750]])

Output tensor from the exported model -

tensor([[-3.4793, -2.1187, -4.3994, -1.4864, -4.5343, -3.9016, -3.1163, -2.0864,
         -1.4641, -3.9484, -4.8409,  1.2172, -3.8068, -4.4060, -2.3133, -3.9828,
         -4.2283, -5.3425,  0.6747, -4.3991, -4.5983, -3.2529, -2.3505]])

In this instance, they both correctly classified the data as class [11].
However, when I reran the script, this time with a different data sample, the output from the exported model was exactly the same as above:

tensor([[-3.4793, -2.1187, -4.3994, -1.4864, -4.5343, -3.9016, -3.1163, -2.0864,
         -1.4641, -3.9484, -4.8409,  1.2172, -3.8068, -4.4060, -2.3133, -3.9828,
         -4.2283, -5.3425,  0.6747, -4.3991, -4.5983, -3.2529, -2.3505]])

ptrblck · June 20, 2020, 7:25am

I cannot reproduce this issue using these scripts:

# script1.py
import torch
import torch.nn as nn

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        self.fc1 = nn.Linear(in_features = 41, out_features = 12) 
        self.fc2 = nn.Linear(in_features = 12, out_features = 18)
        self.out = nn.Linear(in_features = 18, out_features = 23) 
    def forward(self, t):
        t = F.relu(self.fc1(t))
        t = F.relu(self.fc2(t))
        t = self.out(t)
        return t

# Setup
device = 'cuda'
model = Network().to(device)
data = torch.randn(32, 41).to(device)
target = torch.randint(0, 23, (32,)).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Train on random data
nb_epochs = 100
for epoch in range(nb_epochs):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
    print('epocj {}, loss {}'.format(epoch, loss.item()))

# Save
model.eval() # not necessary for your model
output = model(data)
checkpoint = {
    'model': model.state_dict(),
    'data': data,
    'target': target,
    'loss': loss,
    'output': output}
torch.save(checkpoint, 'tmp.pth')

# script2.py
checkpoint = torch.load('tmp.pth')
model = Network()
model.load_state_dict(checkpoint['model'])

data = checkpoint['data']
target = checkpoint['target']
output_reference = checkpoint['output']
loss_reference = checkpoint['loss']

criterion = nn.CrossEntropyLoss()
output_restored = model(data)
loss_restored = criterion(output_restored, target)

print('abs err output ', (output_restored - output_reference).abs().max())
> abs err output  tensor(0., device='cuda:0', grad_fn=<MaxBackward1>)
print('err loss ', (loss_restored - loss_reference).abs().max())
> err loss  tensor(0.0001, device='cuda:0', grad_fn=<MaxBackward1>)

Could you compare your code to mine and check for differences?