Train loss/accuracy is constant

Hi, I am using the custom softmax and loss function but my train/test accuracy and loss is constant. Can anyone please guide what can be the issue?

Model

class FemnistNet(nn.Module):
    def __init__(self):
        super(FemnistNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2) ##output shape (batch, 32, 28, 28)
        th.nn.init.xavier_uniform_(self.conv1.weight)

        self.pool1 = nn.MaxPool2d(2, stride=2, ) ## output shape (batch, 32, 14, 14)
        
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2) ##output shape (batch, 64, 14, 14)
        th.nn.init.xavier_uniform_(self.conv2.weight)

        self.pool2 = nn.MaxPool2d(2, stride=2) ## output shape (batch, 64, 7, 7)
        
        self.fc1 = nn.Linear(3136, 2048)
        th.nn.init.xavier_uniform_(self.fc1.weight)
        
        self.fc2 = nn.Linear(2048 ,62)
        th.nn.init.xavier_uniform_(self.fc2.weight)

    def my_softmax(self, x):
        max_el = x.max(dim=1)
        max_el = max_el[0].reshape(x.shape[0],1)
        result = th.exp(x - max_el)/th.sum(th.exp(x-max_el),dim = 1, keepdim = True)
        return result

    def forward(self, x):
        x = x.view(-1, 1, 28, 28)
        x = self.conv1(x)
        x = th.nn.functional.relu(x)

        x = self.pool1(x)

        x=self.conv2(x)
        x = th.nn.functional.relu(x)
        
        x = self.pool2(x)
        
        x = x.flatten(start_dim=1)
        
        x = self.fc1(x)
        l1_activations = th.nn.functional.relu(x)
        softmax_input = self.fc2(l1_activations)
        x = self.my_softmax(softmax_input)

        grad_self_ = None
        return x, l1_activations, grad_self_

Loss function

def cross_entropy_with_logits(softmax_logits, targets, batch_size):
    eps = PlaceHolder().on(th.tensor(1e-7), wrap = False)
    return -(targets * th.log(softmax_logits+eps)).sum() / batch_size
Round: 1 ---------train loss: tensor([963.06024170])  acc:  tensor([0.])  gradient: tensor(3.11035156)
Round: 2 ---------train loss: tensor([983.20312500])  acc:  tensor([0.])  gradient: tensor(0.)
Round: 3 ---------train loss: tensor([983.20312500])  acc:  tensor([0.])  gradient: tensor(0.)
Round: 4 ---------train loss: tensor([983.20312500])  acc:  tensor([0.])  gradient: tensor(0.)
Round: 5 ---------train loss: tensor([983.20312500])  acc:  tensor([0.])  gradient: tensor(0.)
Round: 6 ---------train loss: tensor([983.20312500])  acc:  tensor([0.])  gradient: tensor(0.)
Round: 7 ---------train loss: tensor([983.20312500])  acc:  tensor([0.])  gradient: tensor(0.)
Round: 8 ---------train loss: tensor([983.20312500])  acc:  tensor([0.])  gradient: tensor(0.)
Round: 9 ---------train loss: tensor([983.20312500])  acc:  tensor([0.])  gradient: tensor(0.)

Here you can see the train loss and accuracy is constant. Moreover, I have printed the few values of gradient of last layer. After first round its constant i.e., 0. Any pointer will be helpful. @ptrblck

I would recommend to compare your custom approach to the native implementation of nn.CrossEntropyLoss and check, if you are running into numerical issues.

@ptrblck thanks for the reply. The issue was resolved but I observe weird accuracy drop after certain round. Please see the attached pic.

State of accuracy loss is as follow:

Round 187: accuracy 0.66 loss : 0.018870125
Round 188: accuracy 0.1525 loss: 12.478773
Round 289: accuracy 0.68 loss: 0.014562485
Round 290: accuracy 0.3 loss : 5.9329453

Weights and gradients are being updated normal. How can we debug this? Do I need to change model? or something else?

EDIT
The above plot was built using the learning rate 0.0003. I have checked if I use learning rate of 0.0001 then accuracy goes up by 0.67 instead of dropping.

@ptrblck Adding the information. I am training the model on mobile using the torchscript functions. Is any mobile precision playing role which disturbs the model weights?

I’m not aware of training imitations of mobile (and thought training is disabled by default there), but your loss curve looks as if the learning rate might be too high and a “bad” update is throwing the parameters off.

Thanks @ptrblck for your reply. Currently, learning rate is 0.0003. What should it be or ideal learning rate to smooth training on mobile? I am using PySyft and Torchscript to enable training on mobiles. Moreover, I have also observed the round for which accuracy drops, gradients in that training round increase. Please see the below logs:

--------- before training test accuracy tensor([0.67500001]) loss tensor([1.83469057], grad_fn=)

Batch: 1 ---------train acc: tensor([1.])  loss:  tensor([0.00270963])  gradient: tensor(0.03910828)
Batch: 2 ---------train acc: tensor([1.])  loss:  tensor([0.00354141])  gradient: tensor(0.05355072)
Batch: 3 ---------train acc: tensor([1.])  loss:  tensor([0.00330056])  gradient: tensor(0.04891968)
Batch: 4 ---------train acc: tensor([1.])  loss:  tensor([0.00275047])  gradient: tensor(0.03326416)
Batch: 5 ---------train acc: tensor([1.])  loss:  tensor([0.00342539])  gradient: tensor(0.03829193)
Batch: 6 ---------train acc: tensor([1.])  loss:  tensor([0.00627712])  gradient: tensor(0.08408356)
Batch: 7 ---------train acc: tensor([1.])  loss:  tensor([0.00305835])  gradient: tensor(0.05062103)
Batch: 8 ---------train acc: tensor([1.])  loss:  tensor([0.00265297])  gradient: tensor(0.04867554)
Batch: 9 ---------train acc: tensor([1.])  loss:  tensor([0.00738869])  gradient: tensor(0.16879272)
Batch: 10 ---------train acc: tensor([1.])  loss:  tensor([0.00872648])  gradient: tensor(0.20343018)
Batch: 11 ---------train acc: tensor([1.])  loss:  tensor([0.00662415])  gradient: tensor(0.15134430)
Batch: 12 ---------train acc: tensor([1.])  loss:  tensor([0.00370510])  gradient: tensor(0.04435730)
Batch: 13 ---------train acc: tensor([1.])  loss:  tensor([0.00247839])  gradient: tensor(0.04190826)
Batch: 14 ---------train acc: tensor([1.])  loss:  tensor([0.00364244])  gradient: tensor(0.10583496)
Batch: 15 ---------train acc: tensor([1.])  loss:  tensor([0.00436443])  gradient: tensor(0.05145264)
Batch: 16 ---------train acc: tensor([1.])  loss:  tensor([0.00194083])  gradient: tensor(0.03260040)
Batch: 17 ---------train acc: tensor([1.])  loss:  tensor([0.01361409])  gradient: tensor(0.55578613)
Batch: 18 ---------train acc: tensor([1.])  loss:  tensor([0.00511362])  gradient: tensor(0.04153442)
Batch: 19 ---------train acc: tensor([1.])  loss:  tensor([0.00864908])  gradient: tensor(0.23938751)
Batch: 20 ---------train acc: tensor([1.])  loss:  tensor([0.00423488])  gradient: tensor(0.05925751)
Batch: 21 ---------train acc: tensor([1.])  loss:  tensor([0.00650160])  gradient: tensor(0.08049774)
Batch: 22 ---------train acc: tensor([1.])  loss:  tensor([0.00435914])  gradient: tensor(0.05432129)
Batch: 23 ---------train acc: tensor([1.])  loss:  tensor([0.00497006])  gradient: tensor(0.13050842)
Batch: 24 ---------train acc: tensor([1.])  loss:  tensor([0.00646069])  gradient: tensor(0.16541290)
Batch: 25 ---------train acc: tensor([1.])  loss:  tensor([0.00413794])  gradient: tensor(0.08065033)
Batch: 26 ---------train acc: tensor([1.])  loss:  tensor([0.00315518])  gradient: tensor(0.07088470)
Batch: 27 ---------train acc: tensor([1.])  loss:  tensor([0.00479086])  gradient: tensor(0.06172562)
Batch: 28 ---------train acc: tensor([1.])  loss:  tensor([0.00254913])  gradient: tensor(0.03941345)
Batch: 29 ---------train acc: tensor([1.])  loss:  tensor([0.00545857])  gradient: tensor(0.10971069)
Batch: 30 ---------train acc: tensor([1.])  loss:  tensor([0.00321923])  gradient: tensor(0.04776001)
Batch: 31 ---------train acc: tensor([1.])  loss:  tensor([0.00573995])  gradient: tensor(0.17864990)
Batch: 32 ---------train acc: tensor([1.])  loss:  tensor([0.00584865])  gradient: tensor(0.05651093)
Batch: 33 ---------train acc: tensor([1.])  loss:  tensor([0.00199604])  gradient: tensor(0.01823425)
Batch: 34 ---------train acc: tensor([1.])  loss:  tensor([0.00447120])  gradient: tensor(0.05294037)
Batch: 35 ---------train acc: tensor([1.])  loss:  tensor([0.00366933])  gradient: tensor(0.05814362)
Batch: 36 ---------train acc: tensor([1.])  loss:  tensor([0.00520886])  gradient: tensor(0.07445526)
Batch: 37 ---------train acc: tensor([1.])  loss:  tensor([0.00521563])  gradient: tensor(0.14889526)
Batch: 38 ---------train acc: tensor([1.])  loss:  tensor([0.00408727])  gradient: tensor(0.07069397)
Batch: 39 ---------train acc: tensor([1.])  loss:  tensor([0.00562172])  gradient: tensor(0.15454865)
Batch: 40 ---------train acc: tensor([1.])  loss:  tensor([0.00431792])  gradient: tensor(0.03463364)
Batch: 41 ---------train acc: tensor([1.])  loss:  tensor([0.00484309])  gradient: tensor(0.06002045)
Batch: 42 ---------train acc: tensor([1.])  loss:  tensor([0.00453356])  gradient: tensor(0.06797791)
Batch: 43 ---------train acc: tensor([1.])  loss:  tensor([0.00888296])  gradient: tensor(0.28428650)
Batch: 44 ---------train acc: tensor([1.])  loss:  tensor([0.00492495])  gradient: tensor(0.12075806)
Batch: 45 ---------train acc: tensor([1.])  loss:  tensor([0.00864079])  gradient: tensor(0.15300751)
Batch: 46 ---------train acc: tensor([1.])  loss:  tensor([0.01277762])  gradient: tensor(0.43139648)
Batch: 47 ---------train acc: tensor([1.])  loss:  tensor([0.01564943])  gradient: tensor(0.52875519)
Batch: 48 ---------train acc: tensor([1.])  loss:  tensor([0.00324077])  gradient: tensor(0.13037872)
Batch: 49 ---------train acc: tensor([1.])  loss:  tensor([0.00372944])  gradient: tensor(0.03945160)
Batch: 50 ---------train acc: tensor([1.])  loss:  tensor([0.00207107])  gradient: tensor(0.02909088)
Batch: 51 ---------train acc: tensor([1.])  loss:  tensor([0.01427121])  gradient: tensor(0.67459106)
Batch: 52 ---------train acc: tensor([1.])  loss:  tensor([0.00931694])  gradient: tensor(0.40195465)
Batch: 53 ---------train acc: tensor([1.])  loss:  tensor([0.00768969])  gradient: tensor(0.34389496)
Batch: 54 ---------train acc: tensor([1.])  loss:  tensor([0.00310259])  gradient: tensor(0.06057739)
Batch: 55 ---------train acc: tensor([1.])  loss:  tensor([0.00399042])  gradient: tensor(0.05764771)
Batch: 56 ---------train acc: tensor([1.])  loss:  tensor([0.02026040])  gradient: tensor(0.65161133)
Batch: 57 ---------train acc: tensor([1.])  loss:  tensor([0.03766305])  gradient: tensor(1.19549561)
Batch: 58 ---------train acc: tensor([0.75000000])  loss:  tensor([2.40318584])  gradient: tensor(21.35678864)
Batch: 59 ---------train acc: tensor([0.10000000])  loss:  tensor([29.59458923])  gradient: tensor(64.19202423)
Batch: 60 ---------train acc: tensor([0.15000001])  loss:  tensor([50.46080399])  gradient: tensor(70.89942169)
Batch: 61 ---------train acc: tensor([0.05000000])  loss:  tensor([106.49205017])  gradient: tensor(111.05201721)
Batch: 62 ---------train acc: tensor([0.])  loss:  tensor([68.02191925])  gradient: tensor(131.00369263)
Batch: 63 ---------train acc: tensor([0.10000000])  loss:  tensor([62.87154007])  gradient: tensor(136.53169250)
Batch: 64 ---------train acc: tensor([0.05000000])  loss:  tensor([130.08886719])  gradient: tensor(96.51695251)
Batch: 65 ---------train acc: tensor([0.05000000])  loss:  tensor([114.15641785])  gradient: tensor(181.31959534)
Batch: 66 ---------train acc: tensor([0.])  loss:  tensor([52.26078415])  gradient: tensor(155.91592407)
Batch: 67 ---------train acc: tensor([0.10000000])  loss:  tensor([80.25596619])  gradient: tensor(162.66026306)
Batch: 68 ---------train acc: tensor([0.05000000])  loss:  tensor([53.15277100])  gradient: tensor(99.49279785)
Batch: 69 ---------train acc: tensor([0.])  loss:  tensor([36.78563690])  gradient: tensor(154.07228088)
Batch: 70 ---------train acc: tensor([0.10000000])  loss:  tensor([27.16488647])  gradient: tensor(71.97927856)
Batch: 71 ---------train acc: tensor([0.15000001])  loss:  tensor([13.00922966])  gradient: tensor(45.71361542)
Batch: 72 ---------train acc: tensor([0.15000001])  loss:  tensor([18.45727348])  gradient: tensor(37.72000885)
Batch: 73 ---------train acc: tensor([0.15000001])  loss:  tensor([13.17590809])  gradient: tensor(43.50273895)
Batch: 74 ---------train acc: tensor([0.25000000])  loss:  tensor([11.23884583])  gradient: tensor(50.67826080)
Batch: 75 ---------train acc: tensor([0.34999999])  loss:  tensor([7.64933777])  gradient: tensor(15.79640198)
Batch: 76 ---------train acc: tensor([0.20000000])  loss:  tensor([7.19791555])  gradient: tensor(20.78481293)
Batch: 77 ---------train acc: tensor([0.20000000])  loss:  tensor([5.78147602])  gradient: tensor(21.11981201)
Batch: 78 ---------train acc: tensor([0.10000000])  loss:  tensor([4.49391651])  gradient: tensor(13.54453659)
Batch: 79 ---------train acc: tensor([0.40000001])  loss:  tensor([3.37770414])  gradient: tensor(6.92783737)
Batch: 80 ---------train acc: tensor([0.10000000])  loss:  tensor([3.86443257])  gradient: tensor(11.98369217)
Batch: 81 ---------train acc: tensor([0.25000000])  loss:  tensor([2.39261270])  gradient: tensor(6.18291473)
Batch: 82 ---------train acc: tensor([0.15000001])  loss:  tensor([3.77315140])  gradient: tensor(2.45497513)
Batch: 83 ---------train acc: tensor([0.25000000])  loss:  tensor([3.26987696])  gradient: tensor(4.19144344)
Batch: 84 ---------train acc: tensor([0.30000001])  loss:  tensor([2.87720037])  gradient: tensor(2.54123211)
Batch: 85 ---------train acc: tensor([0.30000001])  loss:  tensor([2.53983879])  gradient: tensor(5.45108795)
Batch: 86 ---------train acc: tensor([0.25000000])  loss:  tensor([2.51650977])  gradient: tensor(4.41087151)
Batch: 87 ---------train acc: tensor([0.34999999])  loss:  tensor([2.45417166])  gradient: tensor(1.67459679)
Batch: 88 ---------train acc: tensor([0.50000000])  loss:  tensor([1.86572969])  gradient: tensor(2.96950340)
Batch: 89 ---------train acc: tensor([0.55000001])  loss:  tensor([2.09055948])  gradient: tensor(4.55820084)
Batch: 90 ---------train acc: tensor([0.34999999])  loss:  tensor([3.00596380])  gradient: tensor(2.44248676)
Batch: 91 ---------train acc: tensor([0.44999999])  loss:  tensor([2.11242151])  gradient: tensor(5.73801613)
Batch: 92 ---------train acc: tensor([0.34999999])  loss:  tensor([2.84348130])  gradient: tensor(3.78480530)
Batch: 93 ---------train acc: tensor([0.40000001])  loss:  tensor([2.47886109])  gradient: tensor(8.25905991)
Batch: 94 ---------train acc: tensor([0.30000001])  loss:  tensor([2.56659913])  gradient: tensor(6.80049515)
Batch: 95 ---------train acc: tensor([0.34999999])  loss:  tensor([2.97527266])  gradient: tensor(8.28206253)
Batch: 96 ---------train acc: tensor([0.40000001])  loss:  tensor([1.94514394])  gradient: tensor(3.13015175)
Batch: 97 ---------train acc: tensor([0.34999999])  loss:  tensor([2.78319120])  gradient: tensor(4.80315018)
Batch: 98 ---------train acc: tensor([0.55000001])  loss:  tensor([1.99790549])  gradient: tensor(2.43188858)
Batch: 99 ---------train acc: tensor([0.25000000])  loss:  tensor([2.75613332])  gradient: tensor(3.43581772)
Batch: 100 ---------train acc: tensor([0.44999999])  loss:  tensor([1.77791047])  gradient: tensor(2.70160103)

--------- after training test accuracy tensor([0.31000000]) loss tensor([2.67179203], grad_fn=)

You can see that after batch 56 gradients increase. Before training this round, accuracy is 0.67500001 and after training it goes down to 0.31000000. What happened after training round 56? As far as learning rate is considered, training untill 215 rounds worked perfect. What happened in this round? Your pointer will be appreciated. Thanks.