Cost function repeats itself at each epoch... Why?

Capo_Mestre · September 10, 2020, 11:31am

Hello,

I have defined a DenseNet (please follow the link for details) and a custom Loss-function MSE_mod as shown below:

# mean squared error with explicit const and linear terms
def MSE_toOptimize(params,yHat,y):
    y0,y1 = params
    x = [i for i in range(yHat.size()[1])]
    x = torch.tensor(x).to('cuda:0')
    size = yHat.size()[0]*yHat.size()[1]
    diff = yHat + y0 + y1*x - y
    res_ = torch.sum(diff**2) / size
    return res_.cpu().detach().numpy()

def optimizedParametersAre(yHat,y):
    guess = [0.,0.]
    res = minimize(MSE_toOptimize, guess, args=(yHat,y))
    params = res.x
    return torch.tensor(params).to('cuda:0')

def MSE_mod(yHat,y):
    params = optimizedParametersAre(yHat,y)
    loss__ = MSE_toOptimize(params,yHat,y)
    loss__ = torch.tensor(loss__).to('cuda:0').requires_grad_(True)
    return loss__

This function MSE_mod integrates two other functions.

MSE_toOptimize is the function bearing the expression for the Loss-function itself. As seen, the Loss-function is of the form sum((yHat + y0 + y1*x - y)**2), where x is the vector of pixels, and y0 and y1 are parameters which are yet undefined. I want to have such values of these two parameters, so that the current expression for the Loss-function was minimized for given yHat and y (yHat is the predicted vector, and y is the ground-truth vector).
optimizedParametersAre is the function where the optimization takes place. This function returns the values of y0 and y1 that minimize the Loss-function for given input vectors yHat and y.

After training for a while (on a GPU) with such a Loss-function, I am getting the following results (the batch size is 100, the number of batches per epoch is 150):

The L-labeled axis shows the actual values of this custom Loss-function for each of the consecutive batch, whereas J-labeled axis shows the overall cost-function defined as the normalized sum of all Loss-functions over one epoch (there are 7 epochs in total).

The question is why the Loss-values within an epoch repeat each other from one epoch to another without showing any evolution over epochs?

KFrank · September 10, 2020, 2:52pm

Hello Capo!

The short answer is that you have to rewrite your loss function to
properly use autograd so that loss-function gradients can flow back
through your model and can be used to update your model’s parameters

(The most straightforward way to do this is to implement the entirety of
your loss-function computation with pytorch tensor operations, in which
case you get the autograd machinery “for free.”)

[quote=“Capo_Mestre, post:1, topic:95790, full:true”]

# mean squared error with explicit const and linear terms
def MSE_toOptimize(params,yHat,y):
    ...
    x = torch.tensor(x).to('cuda:0')
    ...
    return res_.cpu().detach().numpy()

def optimizedParametersAre(yHat,y):
    ...
    return torch.tensor(params).to('cuda:0')

def MSE_mod(yHat,y):
    ...
    loss__ = torch.tensor(loss__).to('cuda:0').requires_grad_(True)
    return loss__

If you are indeed using the above loss function for the loss you
backpropagate by calling loss.backward(), the backpropagation
won’t work.

The various .to('cuda:0') calls “break the computation graph”
(unless the tensors in question are already on 'cuda:0', in which
case they’re a no-op), and .detach().numpy() in any event breaks
(that part of) the computation graph.

Lastly, calling .requires_grad_(True) in:

loss__ = torch.tensor(loss__).to('cuda:0').requires_grad_(True)

doesn’t fix the problem. It gives you a nice, new pytorch tensor that
will track its gradient going forward, but the damage has already
been done.

As a consequence of the above, non-trivial gradients never flow back
to your model parameters, so your call to optimizer.step() (which
you don’t show, but I’m guessing you have) doesn’t actually update
your model – your model remains unchanged throughout your training
run.

You don’t show how you prepare your batches, but if you don’t shuffle,
or otherwise randomize your training data when you create the batches,
then all the epochs will be cycling through the same set of batch with
the same model, hence giving you the same loss values over and over
again.

(But because, within an epoch, one batch is different from the next,
your loss function will vary within an epoch, but repeat from epoch
to epoch.)

Good luck.

K. Frank

Capo_Mestre · September 10, 2020, 3:24pm

Thanks a lot @KFrank for a great reply! Everything seems to make sense.

Yes, I do use this command in the training process.

Indeed, I don’t shuffle the training data.

I will try to think how to rewrite the code, so that my custom loss-function would work properly having in mind the understanding that your reply brought.

The reason why I did detach().numpy() is that the function minimize was not happy about the torch.tensor() data format as the output of the function MSE_toOptimize.

Capo_Mestre · September 11, 2020, 10:27am

Looking at the documentation, it seems that I should make use of the tensor.clone() function to keep connection with the computation graph. Would you confirm?

Capo_Mestre · September 11, 2020, 10:47am

Only when the function returns the result transfered on cpu, detached and converted to numpy, the minimize function can accept MSE_toOptimize as its first argument. This probably breaks the flow of computation graph, but I am not sure how to make minimize work otherwise. Would you be aware of a simple workaround solution?

Capo_Mestre · September 11, 2020, 11:52am

@KFrank

I found a code here that sounds like a workaround solution, but for now it is not quite transparent how to use it in my situation.

KFrank · September 11, 2020, 3:22pm

Hello Capo!

You are correct; this does break the computation graph.

In order to be able to backpropagate through the minimize() part
of your loss-function computation, pytorch has to be able to get the
gradient of minimize()'s output with respect to its input.

You have two choices:

You can write the minimize()* logic entirely with pytorch tensor
operations, and let pytorch’s autograd calculate the gradients for
you.

Or you can wrap the computation that uses minimize() in a
torch.autograd.Function, work out the gradient of minimize()
“by hand,” and implement that gradient computation in the
.backward() method of your torch.autograd.Function.

(Both approaches are likely to entail significant work on your part.
If I were doing it, I would be lazy and try to cook up a loss function
that doesn’t use minimize(), but still captures the important part
of what minimize() is doing for you in some approximate way.)

*) With the caveat that minimize() is a relatively complicated
iterative algorithm, which can require care writing so that it builds a
proper computation graph, and is likely to generate large, expensive
computation graphs.

Good luck.

K. Frank