Inconsistency with softmax and cross entropy in pytorch

jmaronas · September 6, 2018, 9:05am

I have been making some checks on the softmax log softmax and negative log likelihood with pytorch and I have seen there are some inconsistencies. As example suppose a logit output for cifar100 database in which one of the classes has a very high logit in comparison with the rest. For this, the softmax function outputs probability 1 for that class and 0 for the rest, and for that reason we should expect a crossentropy error of 0 (in case we have a predict the true label) and 1 when not (as the crossentropy computes log(softmax[class])

However I have realized that if I perform a log_softmax operation from the nn module (where I should get a 0 where the softmax has 1 and infinity (or a real high value as I expect we avoid computing logarithm of 0) I get an inconsistency. In this case the log softmax output a 0 for the class with high probability (as expected) but returns different numbers (very negative) for the rest. This is inconsistent for two reasons:

-first: If one class has probability 1 and the rest 0 we should expect that class to have a log_softmax of 0 and the rest have an equal log probability.

-second: If we assume that the output of the nn.CrossEntropy is rounded to 1 (but we really have a 0.999999 for that class and 0.000000001 0.0000000000009 for the rest) we could not have a 0 in the log softmax output (we should expect a value near to zero. I now put some of the outputs:

LOGIT SPACE:

[-151881.58 -53958.38 382600.28 -208273.06 -682387.7
313643.06 -174599.31 314737.03 -47761.547 210986.7
-121455.92 65831.29 253933.14 107649.18 -179261.78
-9338.262 -226704.14 -197389.72 -88550.125 -225601.8
12020.757 305235.8 31988.535 -133836.75 -124994.27
124390.14 67518.836 -231378.08 311258. 92127.34
255807.5 531698. -64797.055 -234956.02 145733.86
383663.34 157211.12 410751.75 -307850.53 119320.98
-494586.7 -71108.56 -217024.64 -343667.8 182377.83
-196660.45 378547.53 -226750.02 229103.94 -76420.19
89305.65 800864.4 284610.66 -144088.16 -356096.2
87200.52 -347407.84 -244253.73 -133480.6 219508.03
-145519.03 62401.516 -79842.984 -94347.93 -371417.62
412408.22 -26637.191 120584.336 -247938.69 -58618.914
15230.674 176264.03 -91443.67 150178.55 516807.47
-144580.42 101580.055 302416.16 279529.4 -202979.7
200805.12 -81993.945 72215.734 -25153.984 -8138.0186
339307.25 -78513.84 403537. -385725.25 319416.94
-292361.7 23827.395 -386195.25 126718.26 169128.44
777514.5 473938.72 126203.87 99491.91 -239480.5 ]

OUTPUT FROM nn.SOFTMAX

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.]

OUTPUT OF LOG SOFTMAX

[[ -952745.94 -854822.75 -418264.1 -1009137.44 -1483252.
-487221.3 -975463.7 -486127.34 -848625.94 -589877.7
-922320.3 -735033.06 -546931.25 -693215.2 -980126.1
-810202.6 -1027568.5 -998254.1 -889414.5 -1026466.2
-788843.6 -495628.56 -768875.8 -934701.1 -925858.6
-676474.25 -733345.56 -1032242.44 -489606.38 -708737.
-545056.9 -269166.38 -865661.44 -1035820.4 -655130.5
-417201.03 -643653.25 -390112.62 -1108714.9 -681543.4
-1295451. -871972.94 -1017889. -1144532.2 -618486.56
-997524.8 -422316.84 -1027614.4 -571760.44 -877284.56
-711558.75 0. -516253.72 -944952.5 -1156960.5
-713663.9 -1148272.2 -1045118.1 -934345. -581356.4
-946383.4 -738462.9 -880707.4 -895212.3 -1172282.
-388456.16 -827501.56 -680280.06 -1048803. -859483.3
-785633.7 -624600.4 -892308.06 -650685.8 -284056.9
-945444.8 -699284.3 -498448.22 -521334.97 -1003844.06
-600059.25 -882858.3 -728648.6 -826018.4 -809002.4
-461557.12 -879378.25 -397327.38 -1186589.6 -481447.44
-1093226. -777037. -1187059.6 -674146.1 -631735.94
-23349.875 -326925.66 -674660.5 -701372.5 -1040344.9 ]]

As we can see the output of log softmax assigns a 0 and that is inconsistent because if probability is 0 we should have 0 for the rest and thus have the same value for the rest of the log softmax (and that is what nn.Softmax outputs).

InnovArul · September 6, 2018, 1:19pm

I think thats the whole point why we use log softmax instead of softmax. i.e., numerical stability.

If we recall the softmax formula, It involves exponential powers. When we have large numbers (as in the array you mentioned), due to limited numerical precision of our machine, the softmax just kills the precision of numbers.

When there is exponentiation involved, log comes to the rescue to avoid the numbers blowing up. Also, In our case, softmax is mostly used in conjunction with CrossEntropyLoss function which needs log likelihood (log probability). By keeping these two reasons in mind, the researchers had come up with a clever trick to directly take log to avoid numerical precision errors.

A glimpse of softmax-cross-entropy derivation:

jmaronas · September 17, 2018, 3:27pm

Hello, thanks for your reply.

That is not the point of my question. For example for computing the softmax people use a trick for numerical stability and you can get accurate softmax post activations without using the log. For example a cuda core that implements this trick is:

//Softmax->implemented for not saturating
__global__ void Softmax(float* E,float* N,float* auxE ,long int sample_dim, long int n_vals)
{
    float C_value=0;
    int thread_id_x = threadIdx.x + blockIdx.x*blockDim.x;
    float maxCoef = E[thread_id_x*sample_dim];
    float actualCoef = 0;
    if (thread_id_x<n_vals)
    {
	    ///REALLY HIGH PROBABILITY OF BRANCH DIVERGENCE. 
            //Description: All of the threads that lie under one condition execute first (stalling the others) and then next. Assuming one clock cycle per operation we would need double time to execute one warp.
	    //Warping divergence: study reduction options for getting the maximum
	    #pragma omp parallel for
	    for (int cA = 1; cA < sample_dim; cA++)
		if (E[thread_id_x*sample_dim+cA] > maxCoef)
			 maxCoef=E[thread_id_x*sample_dim+cA];

	    //No warping divergence as all threads execute the same
	    #pragma omp parallel for
	    for (int cA = 0; cA < sample_dim; cA++)
		{
			actualCoef=expf(E[thread_id_x*sample_dim+cA]-maxCoef);
			auxE[thread_id_x*sample_dim+cA]=actualCoef;
                        C_value+=actualCoef;
		}
            #pragma omp parallel for
	    for (int cA=0; cA < sample_dim; cA++)
	       N[thread_id_x*sample_dim+cA]=auxE[thread_id_x*sample_dim+cA]/C_value;
    }
	
}

And it does not uses the log for nothing. My observation is different.

InnovArul · September 17, 2018, 8:48pm

The max trick that you have mentioned (in C code) helps when the logit values are moderately high (refer max trick here). But the example numbers that you have provided in your question are quite large that even the ‘max trick’ will fail in this case (due to the exponential of large -ve numbers, for example e^(-151881.58-800864.4)).

import torch, numpy as np
x = torch.tensor([-151881.58, -53958.38, 382600.28, -208273.06, -682387.7,
313643.06, -174599.31, 314737.03, -47761.547, 210986.7,
-121455.92, 65831.29, 253933.14, 107649.18, -179261.78,
-9338.262, -226704.14, -197389.72, -88550.125, -225601.8,
12020.757, 305235.8, 31988.535, -133836.75, -124994.27,
124390.14, 67518.836, -231378.08, 311258., 92127.34,
255807.5, 531698., -64797.055, -234956.02, 145733.86,
383663.34, 157211.12, 410751.75, -307850.53, 119320.98,
-494586.7, -71108.56, -217024.64, -343667.8, 182377.83,
-196660.45, 378547.53, -226750.02, 229103.94, -76420.19,
89305.65, 800864.4, 284610.66, -144088.16, -356096.2,
87200.52, -347407.84, -244253.73, -133480.6, 219508.03,
-145519.03, 62401.516, -79842.984, -94347.93, -371417.62,
412408.22, -26637.191, 120584.336, -247938.69, -58618.914,
15230.674, 176264.03, -91443.67, 150178.55, 516807.47,
-144580.42, 101580.055, 302416.16, 279529.4, -202979.7,
200805.12, -81993.945, 72215.734, -25153.984, -8138.0186,
339307.25, -78513.84, 403537., -385725.25, 319416.94,
-292361.7, 23827.395, -386195.25, 126718.26, 169128.44,
777514.5, 473938.72, 126203.87, 99491.91, -239480.5])

# normal softmax
x.softmax(dim=0)

# softmax with max trick
(x-torch.max(x)).softmax(dim=0)

Also, this trick is already implemented in Pytorch (for example, here). Regardless of this, softmax for the (large) numbers in your example is impossible to be computed even in CPU (with double precision).