# How to prevent very large values in final Linear layer?

I have a classification model with 3 hidden layers with ReLU as activation. The final layer of the model is a linear layer. At the time of prediction, the model produces extremely large values, like `1.418e4`. I use a function to perform the `softmax` operation and obtain the class probabilities as follows:

``````def softmax(x):
y = torch.zeros_like(x, dtype=torch.float64)
for i in range(x.shape):
x_temp = x[i, :] - max(x[i, :]) # numerical stability
z = torch.exp(x_temp)
y[i, :] = z / torch.sum(z)
return y

``````

Since the range of values of the linear layer is very wide (over 4 orders of magnitude), the output of the `softmax` operation results in a one-hot encoded output instead of continuous-valued probabilities. Since this is the last layer of the model, I am not using `BatchNormalization` in the model.
Is there any way to prevent the model from producing extremely large numbers as outputs?

Hi Laya!

Based on the information you have provided, this is not necessarily
a problem.

Having the predicted probabilities be, in effect, one-hot encoded just
means that your model is making highly certain predictions. Is there
any reason that your model shouldn’t make highly certain predictions?

Depending on the character of your data and how you train, this could
be the expected outcome of successful training.

(As an aside, why do you write your own `softmax()` function instead
of using `torch.softmax()`?)

Best.

K. Frank

Thank you for the reply @KFrank. I think the model can make highly certain predictions, but not all the time. The data used for training is synthetically generated, and I know that there are some samples where even the ground truth has some uncertainty associated with it. What is troubling me is that the network is confident for all samples, even after training 20 of them with different initialisations and train-test splits.

I wrote my own `softmax()` because `torch.Softmax()` was producing values that were a bunch of ones and something like `2.712` for each example, which violates everything we know about probability distributions. When I wrote my own `softmax()`, I found out that the bunch of ones became zeros and the `2.712` became `1` and I ended up with one-hot encoded outputs. The classification was correct, but the output of `torch.Softmax()` was weird.

Hi Laya!

`torch.softmax()` works for me – and agrees with your `softmax()`
version:

``````>>> import torch
>>> torch.__version__
'1.10.2'
>>> _ = torch.manual_seed (2022)
>>> def softmax(x):
...     y = torch.zeros_like(x, dtype=torch.float64)
...     for i in range(x.shape):
...         x_temp = x[i, :] - max(x[i, :]) # numerical stability
...         z = torch.exp(x_temp)
...         y[i, :] = z / torch.sum(z)
...     return y
...
>>> t = torch.randn (3, 5, dtype = torch.float64)
>>> r1 = softmax (t)
>>> r2 = torch.softmax (t, dim = 1)
>>> torch.allclose (r1, r2)
True
``````

Double-check that you don’t just have a bug somewhere, but if you
are seeing `torch.softmax()` return bogus values, could you tell us
what version of pytorch you are using and post a complete, runnable
script that reproduces the issue?

You don’t say how you are training your classifier or what loss function
you are using, but if you’re using `CrossEntropyLoss`, please be sure
that you are passing the output of your final `Linear` layer directly to
`CrossEntropyLoss` without passing it through `softmax()` or anything
similar. Doing so would “weaken” the certainty of the model’s predictions
so in order to be feeding moderately certain predictions to your loss
function, you would actually be training your model to make very certain
predictions.

Best.

K. Frank

Thank you for the reply @KFrank. I am using `torch.__version__` `'1.10.2'`. I am using an assumed density filtering version of a fully connected neural network inspired from this link: deep_uncertainty_estimation. Every layer takes in two arguments (mean, variance) and produces two outputs (mean, variance).

For the prupose of training of the model, the difference from a regular neural network is the `ReLU` layer here: relu and `BatchNormalization` layer here: bn (I am using a variant for 1D BN). Here is my network architecture:

``````class FCModel(torch.nn.Module):
def __init__(self, input_dim, output_dim, layer_sizes, dropout_prob=0.2):
super().__init__()
self.input_dim = input_dim
self.output_dim = output_dim
layers = []
for i in range(1, len(layer_sizes)):
I am using `CrossEntropyLoss()` and there is no `torch.nn.Softmax()` layer in the model. I use `softmax()` only at the time of evaluation. I actually made the mistake of using `torch.nn.Softmax()` along with `CrossEntropyLoss()` and then removed the layer later on.
In a different version of the model, I had a custom `Softmax()` layer which was activated only when `self.training` was `False` so that the unnormalized outputs of the last `Linear` layer are fed to `CrossEntropyLoss()` and the network produces class probabilities instead of logits at the time of evaluation. That is when I found weird values output by the `torch.nn.Softmax()` layer.