Saved model works bad

huch · May 26, 2018, 3:27pm

Hello Pytorchers,
I am having an issue with trained model. In training, model after time performed very good, but after it was saved and loaded (state_dicts), it could not even give 1 valid prediction.

Network structure:

    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(64, 128)
        self.sig = nn.Sigmoid() # Tried also Tanh and ReLu
        self.fc2 = nn.Linear(128, 64)

Optimizer: RMSprop
Criterion: SmoothL1Loss

P.s. Model size is like 66kb, is it okay?
P.s. Tried different hidden neuron size ( 32, 64, 128 ). Tried different learning rates.

Thank you!

ptrblck · May 26, 2018, 4:55pm

Are you using the same data to evaluate your model?
How was the training and validation error?
Did you make sure to call model.eval() before evaluating the model? Your current model probably doesn’t need it, because it doesn’t contain any Dropout or BatchNorm, but it’s recommended anyway.

huch · May 26, 2018, 5:41pm

Thanks for your answer, Ptrblck!
I am creating AI player and train it by playing the game.
Yes, first I train it with the best move possible. After that I use predict and check if it is correct. If not, repeat, else move and continue with game. In training, after like 15 generations ( 50 * 15 * ~7000 moves ) I reach first models with 100% accuracy.
Yes, i tried model.eval(), it does not change anything.

SimonW · May 26, 2018, 6:18pm

Could you post your save&load code please?

huch · May 26, 2018, 6:24pm

Hi SimonW,
There is nothing special with my save&load.

I save model in hardcoded times in data/time/modelname.pkl

# Save the Model
end = time.time()
for i in range(6):
if (end - self.start >= self.savedTimes[i] and not self.saved[i]):
    self.saved[i] = True
    os.makedirs(os.path.dirname('data/' + time.strftime("%Y%m%d-%H%M%S", time.localtime(self.start)) + str("/") + self.savedFileNames[i]), exist_ok=True)
    torch.save(self.player.state_dict(), 'data/' + time.strftime("%Y%m%d-%H%M%S", time.localtime(self.start)) + str("/") + self.savedFileNames[i])

Load

    def __init__(self, game, color, modelName, size):
        self.game = game
        self.color = color
        if size == 32:
            self.model = Net32()
        elif size == 64:
            self.model = Net64()
        elif size == 128:
            self.model = Net128()

        self.model.load_state_dict(torch.load(modelName))

p.s. Net32, Net64, Net128 have the same content except hidden neuron number. Wrote that in hurry.

SimonW · May 26, 2018, 7:21pm

You are right. This part looks fine… Could you try to come up with a minimal self-contained script that reproduce the issue? Thanks!

roaffix · May 28, 2018, 12:37am

Did you train it on GPU or CPU? I faced some issues when trained the model on GPU and tried to load it on CPU

hughperkins · May 28, 2018, 7:35am

You’re saving torch.player., but restoring into state.model. There’s nothing to say that that is not correct, but it seems … unintuitive/inconsistent. Might be worth making these names consistent, just in case this is masking the actual bug somehow?