I have a classification model with 3 hidden layers with ReLU as activation. The final layer of the model is a linear layer. At the time of prediction, the model produces extremely large values, like
1.418e4. I use a function to perform the
softmax operation and obtain the class probabilities as follows:
y = torch.zeros_like(x, dtype=torch.float64)
for i in range(x.shape):
x_temp = x[i, :] - max(x[i, :]) # numerical stability
z = torch.exp(x_temp)
y[i, :] = z / torch.sum(z)
Since the range of values of the linear layer is very wide (over 4 orders of magnitude), the output of the
softmax operation results in a one-hot encoded output instead of continuous-valued probabilities. Since this is the last layer of the model, I am not using
BatchNormalization in the model.
Is there any way to prevent the model from producing extremely large numbers as outputs?
Based on the information you have provided, this is not necessarily
Having the predicted probabilities be, in effect, one-hot encoded just
means that your model is making highly certain predictions. Is there
any reason that your model shouldn’t make highly certain predictions?
Depending on the character of your data and how you train, this could
be the expected outcome of successful training.
(As an aside, why do you write your own
softmax() function instead
Thank you for the reply @KFrank. I think the model can make highly certain predictions, but not all the time. The data used for training is synthetically generated, and I know that there are some samples where even the ground truth has some uncertainty associated with it. What is troubling me is that the network is confident for all samples, even after training 20 of them with different initialisations and train-test splits.
I wrote my own
torch.Softmax() was producing values that were a bunch of ones and something like
2.712 for each example, which violates everything we know about probability distributions. When I wrote my own
softmax(), I found out that the bunch of ones became zeros and the
1 and I ended up with one-hot encoded outputs. The classification was correct, but the output of
torch.Softmax() was weird.
torch.softmax() works for me – and agrees with your
>>> import torch
>>> _ = torch.manual_seed (2022)
>>> def softmax(x):
... y = torch.zeros_like(x, dtype=torch.float64)
... for i in range(x.shape):
... x_temp = x[i, :] - max(x[i, :]) # numerical stability
... z = torch.exp(x_temp)
... y[i, :] = z / torch.sum(z)
... return y
>>> t = torch.randn (3, 5, dtype = torch.float64)
>>> r1 = softmax (t)
>>> r2 = torch.softmax (t, dim = 1)
>>> torch.allclose (r1, r2)
Double-check that you don’t just have a bug somewhere, but if you
torch.softmax() return bogus values, could you tell us
what version of pytorch you are using and post a complete, runnable
script that reproduces the issue?
You don’t say how you are training your classifier or what loss function
you are using, but if you’re using
CrossEntropyLoss, please be sure
that you are passing the output of your final
Linear layer directly to
CrossEntropyLoss without passing it through
softmax() or anything
similar. Doing so would “weaken” the certainty of the model’s predictions
so in order to be feeding moderately certain predictions to your loss
function, you would actually be training your model to make very certain
Thank you for the reply @KFrank. I am using
'1.10.2'. I am using an assumed density filtering version of a fully connected neural network inspired from this link: deep_uncertainty_estimation. Every layer takes in two arguments (mean, variance) and produces two outputs (mean, variance).
For the prupose of training of the model, the difference from a regular neural network is the
ReLU layer here: relu and
BatchNormalization layer here: bn (I am using a variant for 1D BN). Here is my network architecture:
def __init__(self, input_dim, output_dim, layer_sizes, dropout_prob=0.2):
self.input_dim = input_dim
self.output_dim = output_dim
layers = 
for i in range(1, len(layer_sizes)):
layers.append(adf.Linear(in_features=layer_sizes[i - 1], out_features=layer_sizes[i]))
self.layers = adf.Sequential(*layers)
def forward(self, x_mean, x_var):
x_mean, x_var = self.layers(x_mean, x_var)
return x_mean, x_var
I am using
CrossEntropyLoss() and there is no
torch.nn.Softmax() layer in the model. I use
softmax() only at the time of evaluation. I actually made the mistake of using
torch.nn.Softmax() along with
CrossEntropyLoss() and then removed the layer later on.
In a different version of the model, I had a custom
Softmax() layer which was activated only when
False so that the unnormalized outputs of the last
Linear layer are fed to
CrossEntropyLoss() and the network produces class probabilities instead of logits at the time of evaluation. That is when I found weird values output by the
Hope this helps.