How to prevent very large values in final Linear layer?

I have a classification model with 3 hidden layers with ReLU as activation. The final layer of the model is a linear layer. At the time of prediction, the model produces extremely large values, like 1.418e4. I use a function to perform the softmax operation and obtain the class probabilities as follows:

def softmax(x):
    y = torch.zeros_like(x, dtype=torch.float64)
    for i in range(x.shape[0]):
        x_temp = x[i, :] - max(x[i, :]) # numerical stability
        z = torch.exp(x_temp)
        y[i, :] = z / torch.sum(z)
    return y

Since the range of values of the linear layer is very wide (over 4 orders of magnitude), the output of the softmax operation results in a one-hot encoded output instead of continuous-valued probabilities. Since this is the last layer of the model, I am not using BatchNormalization in the model.
Is there any way to prevent the model from producing extremely large numbers as outputs?

Hi Laya!

Based on the information you have provided, this is not necessarily
a problem.

Having the predicted probabilities be, in effect, one-hot encoded just
means that your model is making highly certain predictions. Is there
any reason that your model shouldn’t make highly certain predictions?

Depending on the character of your data and how you train, this could
be the expected outcome of successful training.

(As an aside, why do you write your own softmax() function instead
of using torch.softmax()?)

Best.

K. Frank

Thank you for the reply @KFrank. I think the model can make highly certain predictions, but not all the time. The data used for training is synthetically generated, and I know that there are some samples where even the ground truth has some uncertainty associated with it. What is troubling me is that the network is confident for all samples, even after training 20 of them with different initialisations and train-test splits.

I wrote my own softmax() because torch.Softmax() was producing values that were a bunch of ones and something like 2.712 for each example, which violates everything we know about probability distributions. When I wrote my own softmax(), I found out that the bunch of ones became zeros and the 2.712 became 1 and I ended up with one-hot encoded outputs. The classification was correct, but the output of torch.Softmax() was weird.

Hi Laya!

torch.softmax() works for me – and agrees with your softmax()
version:

>>> import torch
>>> torch.__version__
'1.10.2'
>>> _ = torch.manual_seed (2022)
>>> def softmax(x):
...     y = torch.zeros_like(x, dtype=torch.float64)
...     for i in range(x.shape[0]):
...         x_temp = x[i, :] - max(x[i, :]) # numerical stability
...         z = torch.exp(x_temp)
...         y[i, :] = z / torch.sum(z)
...     return y
...
>>> t = torch.randn (3, 5, dtype = torch.float64)
>>> r1 = softmax (t)
>>> r2 = torch.softmax (t, dim = 1)
>>> torch.allclose (r1, r2)
True

Double-check that you don’t just have a bug somewhere, but if you
are seeing torch.softmax() return bogus values, could you tell us
what version of pytorch you are using and post a complete, runnable
script that reproduces the issue?

You don’t say how you are training your classifier or what loss function
you are using, but if you’re using CrossEntropyLoss, please be sure
that you are passing the output of your final Linear layer directly to
CrossEntropyLoss without passing it through softmax() or anything
similar. Doing so would “weaken” the certainty of the model’s predictions
so in order to be feeding moderately certain predictions to your loss
function, you would actually be training your model to make very certain
predictions.

Best.

K. Frank

Thank you for the reply @KFrank. I am using torch.__version__ '1.10.2'. I am using an assumed density filtering version of a fully connected neural network inspired from this link: deep_uncertainty_estimation. Every layer takes in two arguments (mean, variance) and produces two outputs (mean, variance).

For the prupose of training of the model, the difference from a regular neural network is the ReLU layer here: relu and BatchNormalization layer here: bn (I am using a variant for 1D BN). Here is my network architecture:

class FCModel(torch.nn.Module):
    def __init__(self, input_dim, output_dim, layer_sizes, dropout_prob=0.2):
        super().__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        layers = []
        layers.append(adf.Linear(in_features=self.input_dim, out_features=layer_sizes[0]))
        layers.append(adf.BatchNorm1D(num_features=layer_sizes[0], track_running_stats=False))
        layers.append(adf.ReLU())
        for i in range(1, len(layer_sizes)):
            layers.append(adf.Linear(in_features=layer_sizes[i - 1], out_features=layer_sizes[i]))
            layers.append(adf.BatchNorm1D(num_features=layer_sizes[i], track_running_stats=False))
            layers.append(adf.ReLU())
            layers.append(adf.Dropout1D(dropout_prob))
        layers.append(adf.Linear(in_features=layer_sizes[-1], out_features=self.output_dim))
        self.layers = adf.Sequential(*layers)

    def forward(self, x_mean, x_var):
        x_mean, x_var = self.layers(x_mean, x_var)
        return x_mean, x_var

I am using CrossEntropyLoss() and there is no torch.nn.Softmax() layer in the model. I use softmax() only at the time of evaluation. I actually made the mistake of using torch.nn.Softmax() along with CrossEntropyLoss() and then removed the layer later on.

In a different version of the model, I had a custom Softmax() layer which was activated only when self.training was False so that the unnormalized outputs of the last Linear layer are fed to CrossEntropyLoss() and the network produces class probabilities instead of logits at the time of evaluation. That is when I found weird values output by the torch.nn.Softmax() layer.

Hope this helps.