# Which loss function is suitable?

I am using PyTorch and am still quite new to the library.

I have a relation between my input and output given by `y = ax + b`, where `a` and `b` are sampled from some distribution (say Uniform), that is, they are random. I would like to train a network to predict `x` upon seeing `y` and `a`. I am employing a network, named `probability_network`, with `nn.Linear` layers. There are `N` (say 10) classes to choose from for `x`.

``````class ProabilityNetwork(nn.Module):
def __init__(self):
super(ProabilityNetwork, self).__init__()
self.fc1 = nn.Linear(8, 76)
self.fc2 = nn.Linear(76, 150)
self.fc3 = nn.Linear(150, 75)
self.fc4 = nn.Linear(75, 14)
self.fc5 = nn.Linear(14, 10)
self.tanh = nn.Tanh()
self.sigmoid = nn.Sigmoid()
self.relu = nn.ReLU()
self.softmax = nn.Softmax(dim=1)

def forward(self, inputs):
return self.softmax(self.fc5(self.fc4(self.relu(self.fc3(self.relu(self.fc2(self.relu(self.fc1(inputs)))))))))

probabilty_network = ProbabilityNetwork()
``````

Upon seeing `y`, the loss function should help the network predict an `x` that minimizes `||y-ax||^2`. All the quantities in `y = ax + b` are vectors (each of length 4, in this example). I have already tried the following loss function.

``````prob_values = probabilty_network(torch.cat([y, a], dim=0))  # shape: (batch_size, 10)
x_hat = mapping_tensor[torch.argmax(prob_values, dim=1)]  # Mapping from probability to one of 10 classes, mapping_tensor is an array of shape (10, 4)

mse_loss = nn.MSELoss()
loss = mse_loss(y, a*x_hat)
``````

As an example, the `mapping_tensor` could contain binary representation of values from `0` (`0000`) to `9` (`1001`). The reason I need the binary representation of the class is that I need a vector `x` for the loss `||y-ax||^2`. In this case, `x` is a `4` length vector whereas the output of the neural network is a `10 ` length vector.

The above setup doesnâ€™t work. Half the values in the predicted class (written out in binary) are always wrong, implying the network is confused while training.

Further, this is not an unachievable problem. Solution to the above mentioned loss function exists (with errors, of course, but the errors are far lesser than 50%) but is computationally intensive. I am trying to check if the network can somehow learn to predict at lower complexity. Any help is appreciated. Thanks.

Furthermore, from an optimization standpoint, the loss function is the best possible solution (that I know of). So, changing the loss function would only lead to poorer results.

Another way to look at the problem is as follows. Suppose the network sees `y` and `a`. The network then computes `|y-ax|` for each class `x` (out of 10 possible classes), and then chooses the class with least value of the computed value. And my question is, what loss function can I use to make the network train in this fashion?

Hi Learner!

I have a couple of comments and questions, in line, below:

Does this mean that you have 10 fixed values of `x` (say, 0.33, 0.52,
0.57, 0.68, etc.) that you know ahead of time? Or is `x` continuous,
but you choose to break it up into 10 categories somehow?

If the latter, I would just predict `x` directly, and not (artificially) impose
10 categories on the problem.

If the former, I would still predict `x` directly â€“ unless you are
purposely using 10 values / categories in order to set up a toy
classification problem with 10 categories.

`Softmax` and `MSELoss` donâ€™t really go together. `Softmax` would be
used for a classification (categorical) problem with a cross-entropy
loss (although you would actually leave out the `Softmax` and use
pytorchâ€™s `CrossEntropyLoss` that has, in effect, `Softmax` built in).

`MSELoss` would be used for a regression-like problem (which is what
yours seems to be), but you would then predicts `x` directly, not pass
it through a `Softmax`, and not categorize it.

It looks to me as if your â€śvectors of length 4â€ť make up batches of
batch size `nBatch = 4`, but that you are building your batch size
into your network (`self.fc1 = nn.Linear(8, 76)`). Pytorch models
work implicitly with batches, so, with two input features (`y` and `a`), you
would want your network to start off with `Linear (2, 76)`. You would
then pass into your network a batch of shape
`[nBatch, nFeatures = 2]`. `nBatch` could be `4`, but it could be other
values, say `nBatch = 100`, without changing the structure of the
network.

In short, I would have your network predict a single value of `x` (or more
precisely, a batch of single values of `x`, but pytorch does that for you
automatically), predict `x` directly (no `Softmax` or other non-linearity
following the final `Linear` layer).

So your network would return something like
`return self.fc5 (self.relu (self.fc4 ( ... ) ) )`, but with
`self.fc5 = nn.Linear (14, 1)`.

(Note that `self.fc5 (self.fc4 (...) )`, without an intervening non-linearity
such as `self.relu()` acts as just a single `Linear` layer, not two.)

I would then use `MSELoss` as the loss criterion.

Best.

K. Frank

1 Like

@KFrank, thank you so much for taking time to write this answer. Really appreciate it. I would like to answer pointwise to your questions.

My `x`s are known before hand. It can be, for example, binary representation of 0 to 10 (each representation having length 4).

The reason for 10 classes is that the loss function that I use, `||y-ax||^2`, is only accurate when `x` is the representation of one of the 10 classes.

This is exactly why I am on the hunt for another loss function that can help me train the network .

The sizes of `y`, `a`, `x` and `b` are all `(batch_size, 4)`.

To summarize, from an optimization perspective, it can be shown that the loss function that I have mentioned in my question is optimal under this setting. Since the network has access to both `a` and `y`, I assume, the knowledge of both must be utilized for training the network. In essence, since the network has access to `y` and `a` and the all the possible classes (`x`), I believe the decision must based on the values of `||y-ax||^2`. Thank you again!

Hi Learner!

I would say that I still donâ€™t really understand your use case.

Your `x`s: Would you say conceptually that they represent classes
(categories) that, although they can be labelled or encoded with
numbers, are not really numerical, but are something like â€ścatâ€ť,â€śdogâ€ť,
â€śgoatâ€ť, where there is not really a sense of â€śdogâ€ť being in between
â€ścatâ€ť and â€śgoatâ€ť. Or are your `x`s fundamentally numerical, in the sense
that `2` is closer to `1` than it is to `4`?

In the first case, you would want to understand your problem as a
multi-class (presumably ten-class) classification problem, have your
model output ten logits from its final `Linear` layer, and use (something
like) `CrossEntropyLoss` as your loss criterion.

In the second case, you would want to understand your problem as
a regression problem, have your final `Linear` layer output a single
predicted numerical value, and use (something like) `MSELoss` as

The underlying question is what, conceptually, do your `x`s represent?

You talk about `x` being a â€śrepresentation of one of the 10 classes,â€ť so
I tend to think that `x`, although perhaps a number or set of numbers,
encodes a non-numerical category.

Understood. 4 is not the batch size.

So it sounds to me like you have one `x` that is encoded somehow with
four numbers, and four randomly chosen `a`s and `b`s that you use
with the four pieces of `x` to produce four `y`s. The four `y`s and four
`a`s are together input to your network, so, indeed, one input sample
consists of eight numbers. (The four `b`s are not input to your model,
so they are in some sense â€śhiddenâ€ť or noise.)

Thatâ€™s fine. The question â€“ to repeat â€“ is what is the conceptual
meaning of `x` (regardless of how it is encoded), so should you be
performing a (ten-class) classification, or a single-predicted-value
regression (or something else)?

Best.

K. Frank

1 Like

@KFrank, okay, let me tell you my exact use case.

I have an encoder (fixed) that maps `m` bit vector, say `z`, to `n` bit (4 in this case, n > m) vector, `x`. This is then passed through a box, which causes random distortions, namely, the multiplicative distortion `a` and additive distortion `b`. `a` and `b` are random but the statistics of the distribution are known. The network that I am designing needs to able to get back the original `m` bit length vector (`z`) upon seeing `y` (`y=ax+b`) and `a`. In literature, for such a setting, it is optimal to choose an `x` (and hence `z` because there is one-one fixed mapping through the encoder) which minimizes squared norm of `y-ax`. I would like to implement this loss function to train the network to give out `x` (or the class of `x`), so that I meet the optimum performance. My implementation through the neural network may be a round about approach to achieve what I want, because I am still a novice.

There can be at most `2^m` classes for `z` and therefore `x`. These classes are assumed to be known to the network. Its job is then to predict the class which gives the minimum loss function value.

Thank you again for replying. Please let me know if you need more clarifications.

Hi Learner!

training) is a bit vector, `z`, of length `m` (say, `m = 3`).

Unless `z` has some additional structure / meaning (that you have not
shared with us, e.g., being the binary representation of an integer), then
you should understand your problem as being a so-called multi-label,
multi-class
classification problem. This means that you have `m`
classes, any number of which (including none or all of them) can
be â€śpresentâ€ť in any sample. That is, any of the `m` bits can be
independently on. (â€śMulti-classâ€ť means you have the `m` classes;
â€śmulti-labelâ€ť means that the classes are not exclusive, so that any
given sample can carry a label for multiple classes at a time.)

The most common approach to such a problem is to have your
final `Linear` layer output `m` logits (one for how strongly each of
the `m` bits is predicted to be â€śonâ€ť). Then use `BCEWithLogitsLoss`
to compare your prediction with your ground-truth `z`. The output
of your model will be of type `float32` and have shape `[nBatch, m]`.
Your ground-truth `target` will have the same shape and also be of
type `float32` and will have value `1.0` whenever the corresponding
bit of `z` is set.

(As an aside, regardless of exactly how you choose to approach
just one or two hidden layers â€“ and then add depth / layers if I
could show that doing so yielded better performance.)

Iâ€™m assuming that this is a toy problem (given your `m` to `n` bit
â€śencodingâ€ť followed by the introduction of additive and multiplicative
noise). If it is a toy problem, then carry on, and use it to practice
building neural networks.

If it represents a real problem, then I suspect that it has enough
structure that there would be a more optimal specialized algorithm
for recovering (filtering) `z` from `y` and `a` (not that I would know
what such an algorithm would be).

It is true that a multi-label, `m`-class problem can be understood as
a single-label `2^m`-class problem. As a general rule, however, you
will be better off treating is as a multi-label problem (unless it has

The problem, of course, is the `2^m` classes, as you lose information
about how the classes are â€ścorrelated,â€ť and the potentially large
number of classes means you have fewer samples per class.

Best.

K. Frank

1 Like

Hello @KFrank!

`z` is the binary representation of the class with `m` bits. As an example, for `m=3` `z` runs from `000` to `111`. So `z` has a particular structure in that it is the exhaustive list of combinations of `1`s and `0`s.

I have two questions here. One, for this to work should the activation function at the end of my network be `nn.Sigmoid` (I checked the documentation and found that `nn.Sigmoid` is added internally in the loss function)? And two, is this loss function optimal for the setting? What I mean is does it lead to a trained network, which can perform as well as `||y-ax||^2` being the loss function? I apologize if my questions do not make sense.

The relation `y=ax+b` can produce infinite number of training samples, theoretically. This is because the there are infinite values of `a` and `b` for a fixed distribution with fixed statistics.

Thank you so much Frank, for taking time to make me understand the working of the loss functions. I feel I am really close to the correct architecture for this problem. Appreciate it.

Hi Learner!

Then (barring other noteworthy details) I would understand this as a
multi-label, three-class classification problem.

To illustrate what I meant by â€śadditional structureâ€ť, consider the mapping
between a set of three bits and the integers 0, â€¦, 7, where, e.g., â€ś011â€ť is
the binary representation of the integer 3.

Letâ€™s say your ground-truth `z` is â€ś011â€ť, while your predicted `z` is â€ś100â€ť.
Would you consider this a maximally bad prediction because all three
of the bits were predicted incorrectly? Or would you consider it a pretty
good (but not perfect) predictions because the ground-truth `z` of â€ś011â€ť
corresponds to the integer 3, while the predicted `z`, â€ś100â€ť, corresponds
to 4, which is only off by 1?

Iâ€™m presuming the former, so viewing this as a multi-label classification
problem (rather than, say, a regression) is appropriate.

`BCEWithLogitsLoss` is the â€śgo-toâ€ť loss function for the multi-label problem.
(There are other reasonable loss functions, but I would use this unless I
had good reason to think that another loss function would be better, and,
even then, I would want to demonstrate that the alternative loss function
did, if fact, perform better.)

If you use `BCEWithLogitsLoss`, then, no, you do not want a `Sigmoid`
for the output of your network.

You do not want the `Sigmoid` because, as you note, the `Sigmoid` is, in
effect, included in `BCEWithLogitsLoss`. (The numerically-less-stable
`BCELoss` does not have the `Sigmoid` built in.)

If my understanding of your problem as being a multi-label, multi-class
classification problem is correct, then, yes, `BCEWithLogitsLoss` is
reasonably likely to be your â€śoptimalâ€ť loss function.

Iâ€™m certainly not able to write a mathematical proof showing that one loss
function or another will be better â€“ the whole business of training realistic
neural networks is sufficiently complicated that such choices are generally
made based on rules of thumb and empirical results.

You can certainly try using both loss functions to train your network and
see which works better. But donâ€™t not try `BCEWithLogitsLoss`, at least
as a baseline.

The beauty of this is that you donâ€™t have to worry about overfitting â€“ just
keep training until your model converges to an acceptable accuracy (or
reaches a plateau that could represent the best it can do).

And again, you might want to experiment with the architecture of your
network, probably including some shallower (fewer layers) networks.

Best.

K. Frank

1 Like

@KFrank, thank you for all your comments. I certainly learnt a lot from the discussion. I have been able to make it work, but the performance could be improved upon. I think this would come in the form of hyperparameter tuning.

I really appreciate your effort in all this. It was not just a mere reply, rather a detailed and wonderful explanation of what my problem should be looked as. I am sure others who run into such problems would find your replies helpful. That said, I am able to do what I set out to do, so I am going to mark this thread as solved.

Thank you so much!