Hi All, my questions might sound a bit naive, but I am really struggling to understand the framework.
I defined my own loss function and it works. When I say “it works”, I mean it does not crash or end up with error, although the loss shows no steady decrease.

The questions:
The back-proparagion algorithm essentially uses the gradient of the loss function, which I do not
provide. Why the above loss function does work, although only the scalar value is computed?
Does the PyTorch framework compute numeric gradient of the loss function internally, or I missed something and my network does a dummy optimisation?

I went through the various topics in the documentation, but still be puzzled with the problem
of proper definition of a loss function. I would greatly appreciate any feedback.

import torch
import torch.nn as nn
loss = nn.MSELoss() # see other loss classes in nn
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss(input, target)
output.backward() # <-- backprop kicks in

Here, you have defined a model architecture with forward pass, but also added the loss function in the forward function. It is usually better to separate these two from each for code-readability. Once you define the forward function, PyTorch (autograd) will take care of the backward function.

A better way to do this, is to make one call that computes the output of your model, and then compute the loss based on that output.

Another thing, are you sure that you want loss=w*|x-y|, and not loss=|w*x - y|?

Thank you for your reply. If I understand it correctly, as soon as a loss function can be expressed as a chain of linear algebra operations, the automatic differentiation can be applied (by decomposition of the loss function into a graph of individual operations). This assumption I drew from examples provided by @ ehsanmok. The latter, possibly, motivates your question regarding the placement of weight term. In short, my particular function naturally follows from the problem in hand.

Yes, that’s right. So, in that respect the code is correct.

So, there is no learnable parameters here, maybe that’s why the loss does not change. What is the model trying to optimize here? Does x comes from output of another network or it is just the input? I assume y is the target, right?

I am experimenting with autoencoders, where y is the original image and x is the reconstructed one, but some pixels are not observable (masked according to various criteria). And the mask can change from one training example to another. Thank you for clarification about automatic differentiation.

One suggestion for debugging, can you try using the same batch of data multiple times, and see if the error is decreasing or no? With the same batch of data, the weights should also be fixed. So, just pass the same batch a few times, and compute the loss, and then do bacward loss.backward() and update the model via optim.step(). Then, you can look at the loss and see if the loss is decreasing or no.