# Longer training epochs when calculating the loss by hand

Hello All,

This is my first post and I am sorry if either I posting in a wrong place or this post is too trivial. But I can’t really figure out what is going wrong here. I will come to the point. I have a tensor looking like this

``````X = torch.ones((6, 8))
X[:, 2:6] = 0
X

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.]])
``````

And I have a Kernel looking like this (Simple vertical edge detection kernel)

`K = torch.tensor([[1, -1]])`

When I apply this kernel on the tensor I get the desired result

``````Y
tensor([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
[ 0.,  1.,  0.,  0.,  0., -1.,  0.]])
``````

Now I want to learn this kernel. Here is what my code looks like

``````conv = nn.Conv2d(1, 1, kernel_size=(1, 2), bias=False)

X = X.reshape(1, 1, 6, 8)
Y = Y.reshape(1, 1, 6, 7)

for i in range(10):
Y_hat = conv(X)
loss = (Y_hat - Y) ** 2
loss.sum().backward()

if i % 2 == 0:
print(f"Loss at {i} epoch is {loss.sum()}")
``````

When I run this algorithm for 10 epochs (as shown in the code above), and then print out the weights of the kernel I get this

``````Parameter containing:
``````

Which is very close to the actual values.

However, when I change the loss calculation my nn.MSELoss Like so -

``````...
loss = nn.MSELoss()(Y_hat, Y)
...
loss.backward()
``````

Training for only 10 epochs does not approximate at all (something like this - `tensor([[[[-0.0427, 0.2747]]]], requires_grad=True)` )

And when I train it for much longer (let’s say 150 epochs or so) then it makes a bit better approximation

Like so - `tensor([[[[ 0.7467, -0.7456]]]], requires_grad=True)`

I am curious to know why this huge difference of epochs are there just because I changed the loss calculation slightly.

Thanks for pointing me to any good resources.

Regards,

Your manual approach is not reducing the loss in the first call:

``````loss = (Y_hat - Y) ** 2
``````

so `loss` will have the shape `[1, 1, 6, 7]` and later you will `sum` this loss.

However, `nn.MSELoss` uses the `reduction='mean'` by default, which would yield a lower loss value and thus also gradients with a smaller magnitude.
If you use `loss = nn.MSELoss(reduction='sum')(Y_hat, Y)`, you would get a similar result.

1 Like

Thank you very much. Now that I think about it, it is trivial. But I was not able to get before. Your answer clears a doubt. Thanks.

However, here is a curious question, as `sum` reduction will produce a larger gradient and thus the converging can be faster, why the choice of `mean` by default? We can always reduce the learning rate in order to regularize the steps if they are too big and we are running into not finding the minima. What is your take on this? Will be curious to know.

Have a good day and thanks again.

Best

The `reduction='mean'` default is chosen, as it will create gradients which do not depend on the batch size.
E.g. the `DataLoader` could return a smaller sample size for the last batch, of the length of the dataset is divisible by the batch size with a remainder. This would scale down the gradients, if you are using `reduction='sum'`. The same applies of course if you are manually changing the batch size somehow.

Also, if you are using a weighted loss, `recution='mean'` will use the applied weights to normalize the reduced loss (as given in the docs). This will not be the case for `reduction='sum'` (the weighted loss will just be summed), which might increase or decrease the loss depending which classes (and thus weights) are found in the current batch.

1 Like

Thanks a lot

Thank you very much. It is clear now.

Thanks for the clarification.

Best regards