I’m trying to figure out what should be the activation function for the continuous distribution output (probability density values per specific value buckets)
Actually I have several assumptions yet each of them has completely different outcomes
First of all I’m trying on the just the very initiated layers and the raw output has both positive and negative values (Fig 1)
Just applying a simple Sofmax extremely prioritizes only one value over others and degenerates the rest values almost completely (Fig 2)
Applying Sofmax twice makes normalization more distributed (which surprised me a lot as I suspect it must be kind of idempotent) (Fig 3)
Applying Sigmoid (which normalizes values pretty good to [0,1] range) + Softmax makes distribution more even but hides the winner peek same time (Fig 4)
Another option is proposed by GPT is to add constant shift to values before the Softmax to solve negative values issue. Still it is not clear how big this shift should be 0.1, 0.01 or 1.0. Anyway theoretically negative range is [-infinity, 0]. So how much should we shift then.
So it really looks all the approaches lead to very different results. And what the approach fits the task best.
Or maybe it doesn’t really matter much which approach to chose and the real logic would be eventually adopted correctly through the learning process regardless which activation would be chosen initially.
The short answer: Don’t use any activation (at the end of your model).
As I understand it, you want your model to predict a discrete probability distribution in that
you have some number of bins (which in your case seems to be fifteen) and you want to
predict a set of fifteen probabilities, one for each bin, that sums to one.
(Perhaps the fifteen probabilities you predict form a fifteen-bin histogram derived from some
continuous probability density function, but your model is predicting the histogram rather
than the underlying probability density function.)
In such a case you would want the last layer of your model to be a Linear with out_features = 15 (followed by no activation layer). Feed the output of your model into CrossEntropyLoss running in the mode where the target contains “probabilities for each
class.” That is, your target would be a batch (with perhaps a batch size of one) of vectors
of fifteen floating-point numbers that sum to one (and are between zero and one). The output
of your model (which is the input to CrossEntropyLoss) are unnormalized log-probabilities
that can run from -inf to inf. These log-probabilities are converted, in effect, to probabilities
inside of CrossEntropyLoss, which passes them, in effect, through a softmax layer.
(Of course, you will use typical non-linear activation layers between internal layers of your
model as you would with more or less any model.)
Actually I was pretty surprised to get a completely brand new options in addition to listed above)
Yes it looks you really follow the underling task correct. Regardless I really miss-phrased the discrete probability distribution with continues one (which it should represent from the original env)
Do I see it right. You mean the exactly what you wrote.
Just to pass the raw out layer output directly to CrossEntrpyLoss.
Or did you mean instead not to pass anything before the Softmax
Could you express a little if this is 1. option. Do you mean the network should just adopt the logic through the learning process and just to use from the [-inf, inf] available value space strictly [0,1] sub range. So would it be inefficient in this case to “waste“ the float32 weights just using a tiny subrange of it?
My assumption was that network should be able to use all the wide range from [-inf, inf] but we are to scale it to [0,1] (e.g. using Sigmoid) just to fit the distribution boundaries
These log-probabilities are converted, in effect, to probabilities
inside of CrossEntropyLoss
Do you mean with this the CrossEntrpyLoss applies the Softmax by its own under the hood?
Here’s I’ve found https://youtu.be/Fv98vtitmiA?t=162 the option voting for the Softmax + CroseEntropy . Still I can’t understand should Softmax really handle negatives well or should it be preprocessed anyway.
So could you please explain a bit more on WHY should we follow your approach not only HOW?
And the follow up question is. If we don’t apply any activation function normalizing [-inf,inf] value space so how actually one can treat the raw outputs while further inference phase as they completely don’t look at all like probabilities?
No. What happens here is that CrossEntropyLoss expects inputs that are unnormalized
log-probabilities that naturally range over (-inf, inf) so your model will get trained to predict
values in this range.
No, because the distribution expected as the input to CrossEntropyLoss is (-inf, inf).
Yes. More precisely, CrossEntropyLoss applies some version of log_softmax() internally.
I haven’t looked through the video, but based on what you’ve said, it’s wrong.
You can train Linear, Softmax, CrossEntropyLoss, and it will sort of train, but it won’t
train well (because you’re not feeding CrossEntropyLoss what it expects).
I’m not entirely surprised that you found this in a video because this error is not uncommon
and we see it from time to time in this forum.
Softmax, by design, handles negatives. It converts (unnormalized) log-probabilities (in (-inf, inf)) to probabilities (in (0.0, 1.0), but will “saturate” to [0.0, 1.0]).
You should do what I say because otherwise grinch-like gremlins will infect your computer
and delete all your recipes for Christmas cookies and then where will you be?
In a bit more detail, it’s often preferable to work with probabilities in “log-space,” that is to
work with log-probabilities rather than probabilities.
There are a number of reasons for this, but a big one is numerical stability. It can be hard to
to represent a probability close to zero numerically – at some point the floating-point value
will underflow to zero – but using log-probabilities, you can represent probabilities that are exponentially close to zero. Similarly – because of finite precision – it’s even harder to
represent a probability close to one – the floating-point number “saturates” to one – but
again, in log-space you can get exponentially close to one.
Additionally, neural networks with gradient descent don’t naturally predict values that
satisfy constraints (such as being in the range [0.0, 1.0]). There are various ways to
enforce such constraints, but they have various issues. If you can do it, you are better
off predicting unconstrained values and then (differentiably) mapping them to your
desired constrained range.
This is what you would be doing with (the numerically-less-stable) Softmax or what CrossEntropyLoss does for you (with its numerically-more-stable internal log_softmax()).
As an aside, all that log_softmax() does is convert unnormalized log-probabilities to normalized log-probabilities. It is idempotent – that is, if you apply it twice you get the
same result as if you apply it only once. softmax() converts unnormalized log-probabilities
to (normalized) probabilities (and is not idempotent).
Take some input (a random vector, say). Apply log_softmax() to it. (Once is enough.) Now
you have a set of normalized log-probabilities. Apply softmax(). Now you have (normalized)
probabilities. Now apply log(). You will get back – up to some round-off error (or perhaps
underflow to -inf) – your normalized log-probabilities.
Use your model to infer (“predict”) the unnormalized log-probabilities. What you do next
depends on what you want to do with those predictions. If your use case lets you work in
log-space and lets you use unnormalized log-probabilities, use your “raw” predictions in
their unnormalized-log-probability form. If you need them to be normalized, but can work in
log-space with its better numerical behavior, use log_softmax to convert your predicted
unnormalized log-probabilities to normalized log-probabilities. If you need (or want for reasons
of intuition about probabilities) actual probabilities, use softmax() to convert your predictions
to probabilities (but train with (unnormalized) log-probabilities fed into CrossEntropyLoss).
Now the raw logits input to the CrossEntrpyLoss looks really logical.
After all I’ve just found the exact statement in the docs for it as
The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general)
Still from the other side I confused with the target values to pass then.
I don’t need exact classification output predicting only one of the available classes. Instead I need to have a pure distribution presentation.
From the other hand for the training data I don’t have a distribution it self at once but only particular specific outcomes per each input which might be whatever else from the support set in another case for the very same input.
So intended approach as for target to take whatever distribution is predicted by the network and just to re-accumulate a bit existing probabilities just around the current (particular) outcome. (Fig 1)
Fig 1. Prepare Target Distribution as Re-accumulation Raw Predicted Probabilities Around Specific Particular Case Outcome
So taking into account that we have just a raw logits for the network output how actually should one compose the targets for the CrossEntropyLoss then.
In case doing it manually should I apply the log_softmax right here to normalize output to the probabilities space . E.g. something like this?
loss = nn.CrossEntropyLoss()
loss(logits, re_accumulate(log_softmax(logits), around=target_outcome))
And does label_smoothing case do exactly the very same under the hood? If I still need the distribution and not the classification would it be enough to provide label_smoothing + class indexes (particular cases) as a targets in this case or should I expand the target as a real distribution anyway?
As an aside, the term logits here is not correct and it is unfortunate that the pytorch
documentation continues to use it in this context. These are (unnormalized) log-probabilities.
So if you have a set of probabilities, {p_i}, you can compute the associated (unnormalized)
log-probabilities, {x_i}, as x_i = log (p_i) + c, where c does not depend on i. (When c = 0, the x_i are just regular log-probabilities, which in this context we call normalized
log-probabilities.)
A logit, on the other hand, is the so-called log-odds-ratio. So the logit, l, that corresponds
to a probability, p, is given by l = log (p / (1 - p)). Logits and log-probabilities do have
some similarities, but they are different. (You will see logits used with BCEWithLogitsLoss.)
(You pass a set of log-probabilities, normalized or not, through softmax() to get the
corresponding set of probabilities and you pass a logit through sigmoid() to get its
corresponding probability.)
Could you clarify what you’re doing here? What, in detail, is your training data?
I would expect that you have a set of annotated samples.
So each sample input might be a list of data about a house, e.g., square footage, size of lot,
number of rooms, postal code, etc., and the annotation (target) might be the house’s price.
Or the input might be an image and the annotation might be an integer categorical class
label denoting classes such as cat, crow, salmon.
Or the input might be past and current weather conditions and the annotation might be a
set probabilities for tomorrow’s precipitation – probability of dry vs. fog vs. rain vs.hail, etc.
Is your case like the second example or like the third? That is, for a single sample, is your
annotation a single class label or is it – for that single sample – a set of probabilities?
And to add some concrete detail, what is the meaning of your input data and what is the
meaning of your annotations?
It actually fits both house price and weather forecast prediction.
The point here that the training data contains only one specific case across multiple really possible ones.
E.g. the house with the very same attributes at one case have specific cost $100k and at the other example it is $150k.
As well as for the same environment sensors the next day humidity could be either 39% at the one specific example as well as 35% at the other one.
So it is about to predict specific continuous attribute value but not just some mean value around it but distributed across the defined space range with according probabilities.
So the idea is for whatever distribution predicted by the model at whatever training level we just iteratively emphasize just only one specific particular case presented with the data for the actual target distribution in order eventually to have in total an overall distribution for the similar inputs.
@KFrank please do you have in mind any specific www resource or book section that discloses the topic the best just to dive into context in more details. As I literally follow the idea just on the level of each paragraph you posted individually )))
You still haven’t told us concretely what your inputs and targets are (nor, for that matter,
what they mean).
So let me assume that for each training sample you have some sort of input and your
target (annotation) is a single class label (e.g., class A, class B, etc.) It may be the case
that very similar inputs don’t all have the same class label and you are interested in
predicting somehow the distribution of class labels that correspond to some set of very
similar inputs.
Perhaps there is some way of using ensembles of inputs as some sort of aggregate
input and training your model to predict the distribution of class labels associated with
such an ensemble, but I don’t know how one might do this (and I don’t have any references
to suggest).
However, even if you train your model using one single class label per input, your model
will still learn something about the distribution you seek. Specifically, understand the output
of your final Linear layer as being unnormalized log-probabilities. If, when you train, many
very similar inputs have several different class labels, your model will learn this structure.
That is, at inference time your model will predict for a similar input a set of log-probabilities
that correspond to probabilities that are spread out over the associated class labels.
Conversely, if some other set of very similar training inputs all have the same class label,
your model will learn that structure as well so that at inference time your model will predict
for a similar input a set of probabilities that is close to one for that specific label and close
to zero for all of the others.
That is, when you train on your training set with single class labels, your model will learn
its “ensemble” structure – even though individual training samples are not ensembles – and
predict useful sets of probabilities, rather than just single class labels.
I don’t have any specific recommendation for where this is all tied together (but there might
be something out there …).
Pytorch’s documentation does explain the calculations performed by Softmax and Sigmoid.
Wikipedia also has entries for “Logit” and “Softmax” and those give a good summary of the
basic facts.