Model not training when implementing a custom loss

thomas · August 7, 2020, 11:39am

I have a code of a custom loss that I implemented on tensorflow that I would like to pass on Pytorch for technical reasons. However I can’t seem to make it work and I don’t know why.
The pytorch loss doesn’t seem to train the network

The aim is to compare raw linear output of a network (before softmax) with true probabilities.
The wanted loss is similar to the one in this video at 13:10

Any help would be appreciated

Tensorflow code:

import numpy as np
from scipy.special import softmax

import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Softmax
from tensorflow.keras import Input, Model

delta = tf.Variable([[1.]], trainable=False)

main_input = Input(shape=(10,))
output = Dense(4, activation='linear')(main_input)

def custom_loss(delta):
    def loss(y_true, y_pred):
        y_pred_softmax = Softmax()(y_pred)
        y_pred_softmax_clipped = K.clip(y_pred_softmax, 1e-8, 1 - 1e-8)
        log_likelihood = y_true * K.log(y_pred_softmax_clipped)
        return K.sum(-log_likelihood * delta)
    return loss

model = Model(inputs=[main_input], outputs=output)
model.compile(optimizer=Adam(lr=0.01), loss=custom_loss(delta))

print(model.predict(np.ones((1,10))))
print(softmax(model.predict(np.ones((1,10)))[0]))
delta.assign([[1.0]])
model.fit(np.ones((1000,10), dtype='float'),np.asarray(1000*[[0.7, 0.3, 0.0, 0.0]], dtype='float'))
print(model.predict(np.ones((1,10))))
print(softmax(model.predict(np.ones((1,10)))[0]))

Pytorch code:

import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc = nn.Linear(10, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

def custom_loss(delta):
    def loss(y_pred, y_true):
        y_pred_softmax = nn.Softmax(dim=1)(y_pred)
        y_pred_softmax_clipped = torch.clamp(y_pred, 1e-8, 1 - 1e-8)
        log_likelihood = y_true * torch.log(y_pred_softmax_clipped)
        return torch.sum(-log_likelihood * delta)
    return loss


delta = 1

network = Net()
loss_function = custom_loss(delta)
optimizer = optim.SGD(network.parameters(), lr=0.01, momentum=0.9)
optimizer.zero_grad()


x = torch.ones((1,10))
print(network(x))
print(nn.Softmax(dim=1)(network(x)), '\n')

n=1000
x = torch.ones((n,10))
target = torch.from_numpy(np.array(n*[[0.8, 0.2, 0.0, 0.0]]))
output = network(x)

loss = loss_function(output, target)
loss.backward()
optimizer.step()
print(loss, '\n')

x = torch.ones((1,10))
print(network(x))
print(nn.Softmax(dim=1)(network(x)))

example of outputs, either only one value become extremely big :

tensor([[-0.0858,  0.1533, -0.1739,  2.0263]], grad_fn=<AddmmBackward>)
tensor([[0.0873, 0.1109, 0.0800, 0.7218]], grad_fn=<SoftmaxBackward>) 

tensor([[-8.5779e-02,  1.4366e+02, -1.7395e-01,  2.0263e+00]], grad_fn=<AddmmBackward>)
tensor([[0., 1., 0., 0.]], grad_fn=<SoftmaxBackward>)

Or the outputs doesn’t change at all:

tensor([[ 1.2120, -0.1411, -0.5820, -0.6478]], grad_fn=<AddmmBackward>)
tensor([[0.6327, 0.1635, 0.1052, 0.0985]], grad_fn=<SoftmaxBackward>) 

tensor([[ 1.2120, -0.1411, -0.5820, -0.6478]], grad_fn=<AddmmBackward>)
tensor([[0.6327, 0.1635, 0.1052, 0.0985]], grad_fn=<SoftmaxBackward>)

KFrank · August 7, 2020, 6:12pm

Hi Thomas -

The short answer is that your pytorch code only takes one optimizer
step.

With default arguments, tf.keras.Model.fit() performs one epoch
of training with a batch size of 32. You pass in 1000 samples, so you
will train for about 30 optimizer steps. This should be enough to train
your very simple model, at least to some degree.

(Note, I have not looked at the rest of your code in any detail.)

In contrast, torch.optim.SGD.step() takes only a single step of the
optimizer. In pytorch, you have to write your own (very simple) optimizer
loop.

Good luck.

K. Frank

thomas · August 8, 2020, 9:13am

Oh thank you so much, I didn’t realize this mechanic about optimization before your comment, it made me understand many things about Pytorch

I tried to change my code, according to what you said in order to make the two code as similar as I possibly can, but I don’t know why it is still not working

import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc = nn.Linear(10, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

    
def custom_loss(delta):
    def loss(y_pred, y_true):
        y_pred_softmax = nn.Softmax(dim=1)(y_pred)
        y_pred_softmax_clipped = torch.clamp(y_pred, 1e-8, 1 - 1e-8)
        log_likelihood = y_true * torch.log(y_pred_softmax_clipped)
        return torch.sum(-log_likelihood * delta)
    return loss


delta = 1
batch_size = 32
n_sample = 1000
network = Net()
loss_function = custom_loss(delta)
optimizer = optim.Adam(network.parameters(), lr=0.01)

x = torch.ones((1,10))
print(network(x))
print(nn.Softmax(dim=1)(network(x)), '\n')

target = [0.7, 0.3, 0.0, 0.0]

for i in range(int(n_sample/batch_size)):
    optimizer.zero_grad()
    
    inputs = torch.ones((batch_size,10))
    targets = torch.FloatTensor(batch_size*[target])
    outputs = network(inputs)
    
    loss = loss_function(outputs, targets)
    loss.backward()
    optimizer.step()

x = torch.ones((1,10), dtype=torch.float)
print(network(x))
print(nn.Softmax(dim=1)(network(x)))

I changed from SGD to Adam, just in case that would affect something (even if I doubt it would) as well as removing the momentum. I used the example of the pytorch website to be sure that I was not missing a step in the training procedure.
However it still doesn’t work as intended

Two examples of outputs:

tensor([[-0.5445,  1.1960, -0.1856,  0.3652]], grad_fn=<AddmmBackward>)
tensor([[0.0942, 0.5370, 0.1349, 0.2340]], grad_fn=<SoftmaxBackward>) 

tensor([[-0.5445,  1.1960, -0.1856,  0.3652]], grad_fn=<AddmmBackward>)
tensor([[0.0942, 0.5370, 0.1349, 0.2340]], grad_fn=<SoftmaxBackward>)

tensor([[-0.1185,  0.5827,  0.7683,  0.4973]], grad_fn=<AddmmBackward>)
tensor([[0.1371, 0.2764, 0.3327, 0.2538]], grad_fn=<SoftmaxBackward>) 

tensor([[-0.1185,  1.8670,  0.7683,  0.4973]], grad_fn=<AddmmBackward>)
tensor([[0.0796, 0.5798, 0.1932, 0.1474]], grad_fn=<SoftmaxBackward>)

output with the TF code:

[[ 0.0284583   0.35027373  0.4858752  -0.61461353]]
[0.2229509  0.30758977 0.35225958 0.11719974]
Train on 1000 samples
1000/1000 [==============================] - 0s 302us/sample - loss: 25.1664
[[ 2.3004684  1.4706173 -1.8008366 -2.9013252]]
[0.68579024 0.29908288 0.01135056 0.00377643]

I also tried to increase the number of sample and modify the learning rate, but the results were still the same

Did you see other mistake in my code that could cause that ?

(again, thank you )

edit: I checked the forward and backward graph and all the function of the loss function are present except softmax, is it normal ?

from torchviz import make_dot, make_dot_from_trace
make_dot(loss, params=dict(network.named_parameters()))

KFrank · August 9, 2020, 12:51am

Hi Thomas!

This is the hint.

def custom_loss(delta):
    def loss(y_pred, y_true):
        y_pred_softmax = nn.Softmax(dim=1)(y_pred)
        y_pred_softmax_clipped = torch.clamp(y_pred, 1e-8, 1 - 1e-8)
        log_likelihood = y_true * torch.log(y_pred_softmax_clipped)
        return torch.sum(-log_likelihood * delta)
    return loss

You pass y_pred (rather than y_pred_softmax) to torch.clamp().
This looks like a simple typo, and the effect is that the call to
nn.Softmax() is bypassed. (Note, you don’t have this typo in your
original tensorflow code.)

Best.

K. Frank

thomas · August 9, 2020, 7:46am

Well, that was embarrassing…
Thank you for your help! It’s working perfectly now

Best,
Thomas