# Cost function repeats itself at each epoch... Why?

Hello,

I have defined a DenseNet (please follow the link for details) and a custom Loss-function `MSE_mod` as shown below:

``````# mean squared error with explicit const and linear terms
def MSE_toOptimize(params,yHat,y):
y0,y1 = params
x = [i for i in range(yHat.size()[1])]
x = torch.tensor(x).to('cuda:0')
size = yHat.size()[0]*yHat.size()[1]
diff = yHat + y0 + y1*x - y
res_ = torch.sum(diff**2) / size
return res_.cpu().detach().numpy()

def optimizedParametersAre(yHat,y):
guess = [0.,0.]
res = minimize(MSE_toOptimize, guess, args=(yHat,y))
params = res.x

def MSE_mod(yHat,y):
params = optimizedParametersAre(yHat,y)
loss__ = MSE_toOptimize(params,yHat,y)
return loss__
``````

This function `MSE_mod` integrates two other functions.

1. `MSE_toOptimize` is the function bearing the expression for the Loss-function itself. As seen, the Loss-function is of the form `sum((yHat + y0 + y1*x - y)**2)`, where `x` is the vector of pixels, and `y0` and `y1` are parameters which are yet undefined. I want to have such values of these two parameters, so that the current expression for the Loss-function was minimized for given `yHat` and `y` (`yHat` is the predicted vector, and `y` is the ground-truth vector).

2. `optimizedParametersAre` is the function where the optimization takes place. This function returns the values of `y0` and `y1` that minimize the Loss-function for given input vectors `yHat` and `y`.

After training for a while (on a GPU) with such a Loss-function, I am getting the following results (the batch size is 100, the number of batches per epoch is 150):

The L-labeled axis shows the actual values of this custom Loss-function for each of the consecutive batch, whereas J-labeled axis shows the overall cost-function defined as the normalized sum of all Loss-functions over one epoch (there are 7 epochs in total).

The question is why the Loss-values within an epoch repeat each other from one epoch to another without showing any evolution over epochs?

Hello Capo!

The short answer is that you have to rewrite your loss function to
through your model and can be used to update your modelâ€™s parameters

(The most straightforward way to do this is to implement the entirety of
your loss-function computation with pytorch tensor operations, in which

[quote=â€śCapo_Mestre, post:1, topic:95790, full:trueâ€ť]

``````# mean squared error with explicit const and linear terms
def MSE_toOptimize(params,yHat,y):
...
x = torch.tensor(x).to('cuda:0')
...
return res_.cpu().detach().numpy()

def optimizedParametersAre(yHat,y):
...

def MSE_mod(yHat,y):
...
return loss__
``````

If you are indeed using the above loss function for the loss you
backpropagate by calling `loss.backward()`, the backpropagation
wonâ€™t work.

The various `.to('cuda:0')` calls â€śbreak the computation graphâ€ť
(unless the tensors in question are already on `'cuda:0'`, in which
case theyâ€™re a no-op), and `.detach().numpy()` in any event breaks
(that part of) the computation graph.

Lastly, calling `.requires_grad_(True)` in:

``````loss__ = torch.tensor(loss__).to('cuda:0').requires_grad_(True)
``````

doesnâ€™t fix the problem. It gives you a nice, new pytorch tensor that
been done.

As a consequence of the above, non-trivial gradients never flow back
to your model parameters, so your call to `optimizer.step()` (which
you donâ€™t show, but Iâ€™m guessing you have) doesnâ€™t actually update
run.

You donâ€™t show how you prepare your batches, but if you donâ€™t shuffle,
or otherwise randomize your training data when you create the batches,
then all the epochs will be cycling through the same set of batch with
the same model, hence giving you the same loss values over and over
again.

(But because, within an epoch, one batch is different from the next,
your loss function will vary within an epoch, but repeat from epoch
to epoch.)

Good luck.

K. Frank

2 Likes

Thanks a lot @KFrank for a great reply! Everything seems to make sense.

Yes, I do use this command in the training process.

Indeed, I donâ€™t shuffle the training data.

I will try to think how to rewrite the code, so that my custom loss-function would work properly having in mind the understanding that your reply brought.

The reason why I did `detach().numpy()` is that the function `minimize` was not happy about the `torch.tensor()` data format as the output of the function `MSE_toOptimize`.

Looking at the documentation, it seems that I should make use of the tensor.clone() function to keep connection with the computation graph. Would you confirm?

Only when the function returns the result transfered on cpu, detached and converted to numpy, the `minimize` function can accept `MSE_toOptimize` as its first argument. This probably breaks the flow of computation graph, but I am not sure how to make `minimize` work otherwise. Would you be aware of a simple workaround solution?

@KFrank

I found a code here that sounds like a workaround solution, but for now it is not quite transparent how to use it in my situation.

Hello Capo!

You are correct; this does break the computation graph.

In order to be able to backpropagate through the `minimize()` part
of your loss-function computation, pytorch has to be able to get the
gradient of `minimize()`'s output with respect to its input.

You have two choices:

You can write the `minimize()`* logic entirely with pytorch tensor
you.

Or you can wrap the computation that uses `minimize()` in a
`torch.autograd.Function`, work out the gradient of `minimize()`
â€śby hand,â€ť and implement that gradient computation in the
`.backward()` method of your `torch.autograd.Function`.

(Both approaches are likely to entail significant work on your part.
If I were doing it, I would be lazy and try to cook up a loss function
that doesnâ€™t use `minimize()`, but still captures the important part
of what `minimize()` is doing for you in some approximate way.)

*) With the caveat that `minimize()` is a relatively complicated
iterative algorithm, which can require care writing so that it builds a
proper computation graph, and is likely to generate large, expensive
computation graphs.

Good luck.

K. Frank

1 Like