Which loss function is suitable?

I am using PyTorch and am still quite new to the library.

I have a relation between my input and output given by y = ax + b, where a and b are sampled from some distribution (say Uniform), that is, they are random. I would like to train a network to predict x upon seeing y and a. I am employing a network, named probability_network, with nn.Linear layers. There are N (say 10) classes to choose from for x.

class ProabilityNetwork(nn.Module):
    def __init__(self):
        super(ProabilityNetwork, self).__init__()
        self.fc1 = nn.Linear(8, 76)
        self.fc2 = nn.Linear(76, 150)
        self.fc3 = nn.Linear(150, 75)
        self.fc4 = nn.Linear(75, 14)
        self.fc5 = nn.Linear(14, 10)
        self.tanh = nn.Tanh()
        self.sigmoid = nn.Sigmoid()
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, inputs):
        return self.softmax(self.fc5(self.fc4(self.relu(self.fc3(self.relu(self.fc2(self.relu(self.fc1(inputs)))))))))

probabilty_network = ProbabilityNetwork()

Upon seeing y, the loss function should help the network predict an x that minimizes ||y-ax||^2. All the quantities in y = ax + b are vectors (each of length 4, in this example). I have already tried the following loss function.

prob_values = probabilty_network(torch.cat([y, a], dim=0))  # shape: (batch_size, 10)
x_hat = mapping_tensor[torch.argmax(prob_values, dim=1)]  # Mapping from probability to one of 10 classes, mapping_tensor is an array of shape (10, 4)

mse_loss = nn.MSELoss()
loss = mse_loss(y, a*x_hat)

As an example, the mapping_tensor could contain binary representation of values from 0 (0000) to 9 (1001). The reason I need the binary representation of the class is that I need a vector x for the loss ||y-ax||^2. In this case, x is a 4 length vector whereas the output of the neural network is a 10 length vector.

The above setup doesn’t work. Half the values in the predicted class (written out in binary) are always wrong, implying the network is confused while training.

Further, this is not an unachievable problem. Solution to the above mentioned loss function exists (with errors, of course, but the errors are far lesser than 50%) but is computationally intensive. I am trying to check if the network can somehow learn to predict at lower complexity. Any help is appreciated. Thanks.

Furthermore, from an optimization standpoint, the loss function is the best possible solution (that I know of). So, changing the loss function would only lead to poorer results.

Another way to look at the problem is as follows. Suppose the network sees y and a. The network then computes |y-ax| for each class x (out of 10 possible classes), and then chooses the class with least value of the computed value. And my question is, what loss function can I use to make the network train in this fashion?

Hi Learner!

I have a couple of comments and questions, in line, below:

Does this mean that you have 10 fixed values of x (say, 0.33, 0.52,
0.57, 0.68, etc.) that you know ahead of time? Or is x continuous,
but you choose to break it up into 10 categories somehow?

If the latter, I would just predict x directly, and not (artificially) impose
10 categories on the problem.

If the former, I would still predict x directly – unless you are
purposely using 10 values / categories in order to set up a toy
classification problem with 10 categories.

Softmax and MSELoss don’t really go together. Softmax would be
used for a classification (categorical) problem with a cross-entropy
loss (although you would actually leave out the Softmax and use
pytorch’s CrossEntropyLoss that has, in effect, Softmax built in).

MSELoss would be used for a regression-like problem (which is what
yours seems to be), but you would then predicts x directly, not pass
it through a Softmax, and not categorize it.

It looks to me as if your “vectors of length 4” make up batches of
batch size nBatch = 4, but that you are building your batch size
into your network (self.fc1 = nn.Linear(8, 76)). Pytorch models
work implicitly with batches, so, with two input features (y and a), you
would want your network to start off with Linear (2, 76). You would
then pass into your network a batch of shape
[nBatch, nFeatures = 2]. nBatch could be 4, but it could be other
values, say nBatch = 100, without changing the structure of the

In short, I would have your network predict a single value of x (or more
precisely, a batch of single values of x, but pytorch does that for you
automatically), predict x directly (no Softmax or other non-linearity
following the final Linear layer).

So your network would return something like
return self.fc5 (self.relu (self.fc4 ( ... ) ) ), but with
self.fc5 = nn.Linear (14, 1).

(Note that self.fc5 (self.fc4 (...) ), without an intervening non-linearity
such as self.relu() acts as just a single Linear layer, not two.)

I would then use MSELoss as the loss criterion.


K. Frank

1 Like

@KFrank, thank you so much for taking time to write this answer. Really appreciate it. I would like to answer pointwise to your questions.

My xs are known before hand. It can be, for example, binary representation of 0 to 10 (each representation having length 4).

The reason for 10 classes is that the loss function that I use, ||y-ax||^2, is only accurate when x is the representation of one of the 10 classes.

This is exactly why I am on the hunt for another loss function that can help me train the network :smiley: .

The sizes of y, a, x and b are all (batch_size, 4).

To summarize, from an optimization perspective, it can be shown that the loss function that I have mentioned in my question is optimal under this setting. Since the network has access to both a and y, I assume, the knowledge of both must be utilized for training the network. In essence, since the network has access to y and a and the all the possible classes (x), I believe the decision must based on the values of ||y-ax||^2. Thank you again!

Hi Learner!

I would say that I still don’t really understand your use case.

Your xs: Would you say conceptually that they represent classes
(categories) that, although they can be labelled or encoded with
numbers, are not really numerical, but are something like “cat”,“dog”,
“goat”, where there is not really a sense of “dog” being in between
“cat” and “goat”. Or are your xs fundamentally numerical, in the sense
that 2 is closer to 1 than it is to 4?

In the first case, you would want to understand your problem as a
multi-class (presumably ten-class) classification problem, have your
model output ten logits from its final Linear layer, and use (something
like) CrossEntropyLoss as your loss criterion.

In the second case, you would want to understand your problem as
a regression problem, have your final Linear layer output a single
predicted numerical value, and use (something like) MSELoss as
your loss criterion.

The underlying question is what, conceptually, do your xs represent?

You talk about x being a “representation of one of the 10 classes,” so
I tend to think that x, although perhaps a number or set of numbers,
encodes a non-numerical category.

Understood. 4 is not the batch size.

So it sounds to me like you have one x that is encoded somehow with
four numbers, and four randomly chosen as and bs that you use
with the four pieces of x to produce four ys. The four ys and four
as are together input to your network, so, indeed, one input sample
consists of eight numbers. (The four bs are not input to your model,
so they are in some sense “hidden” or noise.)

That’s fine. The question – to repeat – is what is the conceptual
meaning of x (regardless of how it is encoded), so should you be
performing a (ten-class) classification, or a single-predicted-value
regression (or something else)?


K. Frank

1 Like

@KFrank, okay, let me tell you my exact use case.

I have an encoder (fixed) that maps m bit vector, say z, to n bit (4 in this case, n > m) vector, x. This is then passed through a box, which causes random distortions, namely, the multiplicative distortion a and additive distortion b. a and b are random but the statistics of the distribution are known. The network that I am designing needs to able to get back the original m bit length vector (z) upon seeing y (y=ax+b) and a. In literature, for such a setting, it is optimal to choose an x (and hence z because there is one-one fixed mapping through the encoder) which minimizes squared norm of y-ax. I would like to implement this loss function to train the network to give out x (or the class of x), so that I meet the optimum performance. My implementation through the neural network may be a round about approach to achieve what I want, because I am still a novice.

There can be at most 2^m classes for z and therefore x. These classes are assumed to be known to the network. Its job is then to predict the class which gives the minimum loss function value.

Thank you again for replying. Please let me know if you need more clarifications.

Hi Learner!

Yes, this clarifies things. Your “ground truth” (your known “target” for
training) is a bit vector, z, of length m (say, m = 3).

Unless z has some additional structure / meaning (that you have not
shared with us, e.g., being the binary representation of an integer), then
you should understand your problem as being a so-called multi-label,
classification problem. This means that you have m
classes, any number of which (including none or all of them) can
be “present” in any sample. That is, any of the m bits can be
independently on. (“Multi-class” means you have the m classes;
“multi-label” means that the classes are not exclusive, so that any
given sample can carry a label for multiple classes at a time.)

The most common approach to such a problem is to have your
final Linear layer output m logits (one for how strongly each of
the m bits is predicted to be “on”). Then use BCEWithLogitsLoss
to compare your prediction with your ground-truth z. The output
of your model will be of type float32 and have shape [nBatch, m].
Your ground-truth target will have the same shape and also be of
type float32 and will have value 1.0 whenever the corresponding
bit of z is set.

(As an aside, regardless of exactly how you choose to approach
your problem, I would start with a shallower architecture – maybe
just one or two hidden layers – and then add depth / layers if I
could show that doing so yielded better performance.)

I’m assuming that this is a toy problem (given your m to n bit
“encoding” followed by the introduction of additive and multiplicative
noise). If it is a toy problem, then carry on, and use it to practice
building neural networks.

If it represents a real problem, then I suspect that it has enough
structure that there would be a more optimal specialized algorithm
for recovering (filtering) z from y and a (not that I would know
what such an algorithm would be).

It is true that a multi-label, m-class problem can be understood as
a single-label 2^m-class problem. As a general rule, however, you
will be better off treating is as a multi-label problem (unless it has
some special structure your network can take advantage of).

The problem, of course, is the 2^m classes, as you lose information
about how the classes are “correlated,” and the potentially large
number of classes means you have fewer samples per class.


K. Frank

1 Like

Hello @KFrank!

z is the binary representation of the class with m bits. As an example, for m=3 z runs from 000 to 111. So z has a particular structure in that it is the exhaustive list of combinations of 1s and 0s.

I have two questions here. One, for this to work should the activation function at the end of my network be nn.Sigmoid (I checked the documentation and found that nn.Sigmoid is added internally in the loss function)? And two, is this loss function optimal for the setting? What I mean is does it lead to a trained network, which can perform as well as ||y-ax||^2 being the loss function? I apologize if my questions do not make sense.

The relation y=ax+b can produce infinite number of training samples, theoretically. This is because the there are infinite values of a and b for a fixed distribution with fixed statistics.

Thank you so much Frank, for taking time to make me understand the working of the loss functions. I feel I am really close to the correct architecture for this problem. Appreciate it.

Hi Learner!

Then (barring other noteworthy details) I would understand this as a
multi-label, three-class classification problem.

To illustrate what I meant by “additional structure”, consider the mapping
between a set of three bits and the integers 0, …, 7, where, e.g., “011” is
the binary representation of the integer 3.

Let’s say your ground-truth z is “011”, while your predicted z is “100”.
Would you consider this a maximally bad prediction because all three
of the bits were predicted incorrectly? Or would you consider it a pretty
good (but not perfect) predictions because the ground-truth z of “011”
corresponds to the integer 3, while the predicted z, “100”, corresponds
to 4, which is only off by 1?

I’m presuming the former, so viewing this as a multi-label classification
problem (rather than, say, a regression) is appropriate.

BCEWithLogitsLoss is the “go-to” loss function for the multi-label problem.
(There are other reasonable loss functions, but I would use this unless I
had good reason to think that another loss function would be better, and,
even then, I would want to demonstrate that the alternative loss function
did, if fact, perform better.)

If you use BCEWithLogitsLoss, then, no, you do not want a Sigmoid
for the output of your network.

You do not want the Sigmoid because, as you note, the Sigmoid is, in
effect, included in BCEWithLogitsLoss. (The numerically-less-stable
BCELoss does not have the Sigmoid built in.)

If my understanding of your problem as being a multi-label, multi-class
classification problem is correct, then, yes, BCEWithLogitsLoss is
reasonably likely to be your “optimal” loss function.

I’m certainly not able to write a mathematical proof showing that one loss
function or another will be better – the whole business of training realistic
neural networks is sufficiently complicated that such choices are generally
made based on rules of thumb and empirical results.

You can certainly try using both loss functions to train your network and
see which works better. But don’t not try BCEWithLogitsLoss, at least
as a baseline.

The beauty of this is that you don’t have to worry about overfitting – just
keep training until your model converges to an acceptable accuracy (or
reaches a plateau that could represent the best it can do).

And again, you might want to experiment with the architecture of your
network, probably including some shallower (fewer layers) networks.


K. Frank

1 Like

@KFrank, thank you for all your comments. I certainly learnt a lot from the discussion. I have been able to make it work, but the performance could be improved upon. I think this would come in the form of hyperparameter tuning.

I really appreciate your effort in all this. It was not just a mere reply, rather a detailed and wonderful explanation of what my problem should be looked as. I am sure others who run into such problems would find your replies helpful. That said, I am able to do what I set out to do, so I am going to mark this thread as solved.

Thank you so much!