Why is my loss getting lost?

Hey, I’m new to PyTorch and do not understand all the stuff that is happening in the background. For some reason loss.item() becomes “nan” after a couple of iterations, if I activate interStep. Why does this happen?

I’m glad for any help.

class interStep:
    def __init__(self, named_params):
        self.named_params = list(named_params)

        # names to filter out
        to_remove = ['bn'] # Filter out BatchNorm
        fliter_map = lambda x: not any(name in x[0] for name in to_remove)

        # weights and biases
        self.weights = [(n, p) for n, p in self.named_params if 'weight' in n]
        self.weights = list(filter(fliter_map, self.weights))
        self.n_layers = len(self.weights)

    def _get_weight(self, i):
        _, param = self.weights[i]
        return param

    def step(self,st_dev_activations, st_dev_errors):
        for i in range(1, self.n_layers - 1):
            self._get_weight(i - 1).data.mul_(st_dev_errors)
            self._get_weight(i).data.mul_(st_dev_activations)


def train(interStep ):
    for i, (input, target) in enumerate(train_loader):
        output = model(input)
        loss = criterion(output, target)

        if interStep is not None:
            sdactivations = torch.std(output)
            output = torch.div(output, sdactivations)
            sdloss = torch.std(loss)
            loss = torch.div(loss, sdloss)
            interStep.step(sdactivations, sdloss)
 
        print(loss.item())
        loss = torch.mean(loss)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Is your standard deviation finite? Could you run it and print out sdactivations and sdloss at each iteration? If your standard deviation isn’t finite (perhaps from a too small batch size) you could be dividing by 0.

@AlphaBetaGamma96 Here is a output sample:

sdactivations: 0.9688094258308411, sdloss: 0.8842137455940247
lossItem: 3.0881900787353516
Epoch: [0][0/160]	Time 0.307 (0.307)	Data 0.256 (0.256)	Speed 834.981 (834.981)	Loss 3.0882 (3.0882)	Acc@1 8.594 (8.594)	Acc@5 46.875 (46.875)	Count 256
sdactivations: 1.056261420249939, sdloss: 0.9665254354476929
lossItem: 2.8299927711486816
sdactivations: 1.2517436742782593, sdloss: 1.4695396423339844
lossItem: 2.0257301330566406
sdactivations: 1.635145664215088, sdloss: 2.187786817550659
lossItem: 1.4797642230987549
sdactivations: 2.0124547481536865, sdloss: 4.278054237365723
lossItem: 0.874698281288147
sdactivations: 2.4762659072875977, sdloss: 5.951650142669678
lossItem: 0.6256335377693176
sdactivations: 2.857673406600952, sdloss: 4.7043633460998535
lossItem: 0.8196429014205933
sdactivations: 3.238481283187866, sdloss: 5.849048614501953
lossItem: 0.640394926071167
sdactivations: 3.587682008743286, sdloss: 6.391217231750488
lossItem: 0.6264476776123047
sdactivations: 3.893690347671509, sdloss: 4.930907249450684
lossItem: 0.7258576154708862
sdactivations: 4.203991413116455, sdloss: 6.16979455947876
lossItem: 0.5993886590003967
sdactivations: 4.498993396759033, sdloss: 9.638185501098633
lossItem: 0.5030274987220764
sdactivations: 3.3001585006713867, sdloss: 3.1007816791534424
lossItem: 2.634819507598877
sdactivations: nan, sdloss: nan
lossItem: nan

As they are getting nan all at the same time I think this should not be an issue

Indeed. Given that all of them become nan at the same time, seems that it’s something else. Does the nan always occur at the same iteration? Also, what loss function are you using? It might be useful to also print out the output and target too, as that gives a nan for the loss value!

My loss function is:

criterion = nn.CrossEntropyLoss(reduction='none')
Iteration 0
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
------------------------------------------------------------------------------------------------------------
Iteration num: 0
tensor([6, 6, 3, 0, 5, 2, 8, 2, 8, 0, 1, 1, 1, 8, 6, 6, 2, 9, 8, 8, 6, 8, 1, 3,
        7, 1, 1, 0, 3, 3, 9, 2, 4, 2, 6, 5, 9, 7, 4, 0, 3, 2, 7, 3, 5, 9, 0, 6,
        6, 0, 7, 4, 0, 4, 1, 7, 1, 1, 8, 5, 4, 4, 1, 7, 2, 5, 6, 8, 8, 6, 9, 2,
        4, 6, 1, 8, 1, 8, 9, 0, 9, 1, 4, 4, 0, 7, 4, 1, 2, 2, 7, 5, 1, 6, 0, 3,
        9, 4, 2, 5, 8, 7, 8, 1, 2, 1, 0, 2, 6, 9, 5, 5, 3, 7, 0, 3, 5, 9, 4, 5,
        3, 2, 8, 7, 7, 4, 6, 6, 9, 0, 6, 1, 6, 4, 1, 2, 1, 8, 9, 0, 6, 0, 8, 2,
        2, 5, 0, 7, 8, 3, 8, 1, 2, 4, 2, 2, 5, 0, 9, 6, 0, 9, 4, 7, 1, 3, 1, 0,
        6, 9, 2, 0, 9, 7, 1, 4, 3, 3, 9, 4, 9, 8, 7, 1, 9, 2, 3, 8, 7, 6, 3, 4,
        8, 6, 7, 4, 8, 0, 0, 9, 4, 1, 9, 6, 7, 9, 0, 7, 6, 4, 3, 6, 3, 0, 4, 7,
        1, 4, 0, 3, 6, 4, 6, 2, 0, 1, 9, 5, 4, 9, 4, 3, 1, 1, 0, 4, 4, 1, 3, 9,
        8, 4, 7, 9, 0, 3, 3, 5, 4, 6, 5, 1, 0, 6, 3, 8], device='cuda:0')
tensor([[ 0.4823,  0.2964,  1.1482,  ...,  0.2172, -0.4138, -1.0365],
        [-0.7772,  0.2143,  0.5901,  ...,  1.0894,  0.0747, -0.1231],
        [-0.6043, -0.5650,  0.5918,  ...,  0.3752, -0.4515, -2.1144],
        ...,
        [-1.2358, -1.7264, -0.0189,  ...,  1.6880, -1.1934, -1.3710],
        [-1.6906,  0.5091,  0.9523,  ..., -0.3808, -0.3868, -0.5108],
        [-1.3468,  1.0241, -0.0880,  ...,  1.0244,  0.3306, -1.9983]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 1.0097033977508545, sdloss: 1.0652413368225098
lossItem: 2.6301562786102295
Epoch: [0][0/160]	Time 0.413 (0.413)	Data 0.377 (0.377)	Speed 620.382 (620.382)	Loss 2.6302 (2.6302)	Acc@1 8.203 (8.203)	Acc@5 46.875 (46.875)	Count 256
------------------------------------------------------------------------------------------------------------
Iteration num: 1
tensor([3, 1, 5, 7, 2, 5, 5, 8, 4, 3, 9, 7, 3, 4, 2, 6, 7, 2, 2, 0, 8, 1, 4, 9,
        7, 7, 8, 8, 5, 6, 6, 8, 0, 4, 2, 4, 5, 8, 3, 6, 9, 7, 8, 9, 6, 7, 1, 8,
        8, 1, 0, 8, 9, 2, 7, 1, 2, 0, 2, 4, 8, 8, 2, 9, 0, 1, 2, 6, 9, 9, 8, 5,
        1, 0, 1, 5, 0, 9, 6, 8, 5, 9, 1, 0, 0, 9, 4, 1, 1, 7, 1, 1, 6, 7, 6, 5,
        3, 8, 8, 0, 0, 4, 9, 1, 0, 1, 7, 3, 2, 4, 4, 6, 8, 1, 1, 8, 3, 3, 1, 9,
        6, 0, 1, 5, 9, 3, 8, 4, 1, 6, 1, 1, 4, 9, 8, 7, 3, 0, 3, 4, 1, 9, 2, 1,
        7, 8, 7, 6, 8, 2, 1, 3, 8, 9, 3, 5, 2, 1, 3, 9, 8, 7, 8, 8, 3, 7, 8, 3,
        7, 0, 1, 0, 8, 1, 9, 0, 9, 1, 8, 7, 3, 5, 4, 9, 5, 3, 1, 6, 9, 1, 3, 2,
        5, 5, 8, 8, 1, 0, 1, 3, 2, 9, 7, 6, 1, 4, 2, 3, 0, 7, 2, 4, 9, 2, 2, 7,
        7, 2, 2, 0, 2, 5, 6, 9, 2, 5, 0, 2, 1, 2, 9, 5, 1, 7, 4, 3, 6, 4, 4, 5,
        6, 2, 4, 8, 2, 3, 1, 3, 5, 7, 8, 8, 4, 1, 5, 1], device='cuda:0')
tensor([[ 2.4735, -1.9027,  0.9066,  ..., -0.4060,  0.5519, -1.6199],
        [-1.7410, -0.5936,  0.3711,  ..., -0.0734, -2.2461, -1.0728],
        [-0.6628, -1.2386,  0.8018,  ..., -0.1567, -1.2360, -1.5239],
        ...,
        [-1.0523, -2.6707,  0.6930,  ...,  0.6356, -1.0273, -1.4660],
        [ 1.9683,  0.0360,  0.0793,  ...,  0.6470, -0.6063, -0.2374],
        [ 0.3069, -1.1807,  0.6074,  ...,  0.2922,  0.0501, -0.9813]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 1.083749532699585, sdloss: 1.0792979001998901
lossItem: 2.7319676876068115
------------------------------------------------------------------------------------------------------------
Iteration num: 2
tensor([9, 9, 6, 9, 8, 8, 5, 4, 0, 3, 0, 3, 4, 8, 2, 2, 2, 9, 2, 7, 7, 8, 3, 0,
        5, 1, 6, 0, 4, 2, 7, 9, 0, 0, 8, 9, 7, 2, 2, 3, 3, 5, 6, 9, 1, 6, 5, 5,
        3, 5, 0, 4, 5, 1, 4, 1, 9, 6, 5, 3, 0, 3, 7, 6, 4, 3, 7, 3, 2, 3, 9, 3,
        9, 6, 6, 1, 4, 8, 7, 3, 1, 3, 4, 2, 6, 1, 6, 5, 9, 5, 3, 1, 2, 6, 4, 8,
        2, 2, 0, 9, 8, 2, 4, 1, 5, 6, 4, 8, 8, 7, 9, 8, 8, 3, 1, 2, 0, 0, 5, 8,
        1, 2, 6, 8, 1, 1, 0, 4, 8, 9, 3, 9, 6, 2, 4, 3, 4, 9, 8, 3, 7, 8, 3, 1,
        1, 1, 7, 2, 3, 4, 0, 8, 5, 9, 1, 8, 4, 2, 7, 0, 1, 1, 3, 8, 1, 5, 4, 2,
        4, 4, 9, 2, 0, 6, 2, 3, 5, 5, 5, 2, 2, 0, 7, 5, 1, 3, 9, 9, 9, 4, 5, 4,
        9, 6, 8, 5, 2, 4, 6, 4, 5, 9, 7, 0, 9, 5, 6, 4, 5, 7, 5, 9, 7, 4, 9, 9,
        7, 8, 0, 7, 7, 8, 1, 8, 2, 6, 6, 1, 6, 8, 8, 2, 5, 7, 7, 3, 4, 7, 1, 8,
        5, 5, 7, 7, 9, 2, 0, 3, 1, 8, 6, 5, 3, 5, 2, 2], device='cuda:0')
tensor([[-2.0396, -0.5301,  0.8984,  ..., -0.0925,  0.3624, -2.6404],
        [-1.1958,  0.9594,  0.7785,  ...,  0.4484, -0.2216, -2.3675],
        [-1.0618, -0.5347,  0.1386,  ..., -0.2420, -0.6653, -1.2504],
        ...,
        [ 0.3692, -1.9406, -0.1676,  ...,  0.6972,  0.2905, -1.9543],
        [ 0.1557, -0.1204, -0.0210,  ...,  1.1594, -1.2747, -1.6906],
        [-3.9985, -3.0479,  2.0096,  ..., -0.5583,  1.6112, -2.3361]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 1.2101424932479858, sdloss: 1.3503553867340088
lossItem: 2.141524314880371
------------------------------------------------------------------------------------------------------------
Iteration num: 3
tensor([9, 0, 9, 0, 4, 5, 5, 0, 7, 7, 7, 0, 3, 1, 1, 2, 5, 4, 9, 1, 7, 0, 5, 6,
        2, 9, 2, 3, 4, 1, 9, 4, 1, 4, 1, 8, 7, 5, 5, 7, 9, 9, 9, 4, 7, 7, 5, 8,
        4, 6, 6, 8, 0, 6, 2, 0, 2, 7, 5, 0, 8, 4, 1, 8, 2, 9, 4, 1, 8, 2, 2, 7,
        8, 7, 5, 4, 4, 4, 9, 7, 9, 0, 9, 8, 2, 2, 4, 7, 6, 4, 9, 7, 1, 1, 4, 7,
        7, 4, 5, 2, 1, 4, 5, 2, 0, 9, 4, 3, 8, 3, 7, 1, 6, 0, 2, 8, 0, 6, 1, 5,
        2, 2, 6, 2, 0, 9, 0, 1, 6, 5, 4, 1, 7, 9, 1, 4, 9, 0, 9, 2, 0, 4, 9, 9,
        8, 4, 3, 4, 3, 0, 8, 9, 2, 0, 9, 3, 5, 5, 5, 0, 3, 7, 3, 9, 7, 4, 3, 0,
        4, 2, 3, 1, 3, 5, 4, 2, 4, 9, 9, 5, 6, 0, 5, 9, 3, 4, 3, 2, 5, 1, 9, 3,
        2, 2, 4, 2, 7, 2, 7, 9, 1, 7, 5, 1, 1, 5, 0, 0, 2, 5, 0, 9, 7, 3, 0, 7,
        7, 9, 2, 9, 9, 4, 3, 5, 4, 2, 6, 9, 7, 0, 6, 5, 4, 7, 9, 0, 3, 8, 6, 0,
        5, 3, 5, 4, 8, 5, 5, 8, 7, 8, 4, 4, 3, 8, 1, 2], device='cuda:0')
tensor([[ 0.6528, -0.4094,  0.7019,  ...,  0.2155, -1.9526, -2.3808],
        [-0.5322, -1.1375, -0.5830,  ...,  1.2385, -0.1592, -1.0004],
        [-1.0041, -0.2019,  1.3646,  ..., -0.2524, -0.2368, -2.4263],
        ...,
        [-1.1273, -0.6086,  2.0380,  ...,  1.7105, -1.1619, -1.7979],
        [-0.3133,  0.6419,  1.6696,  ...,  0.2661,  0.4561, -0.3616],
        [-0.8983, -0.9374, -0.1871,  ..., -0.2925, -0.6359, -3.0607]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 1.4503097534179688, sdloss: 1.783054232597351
lossItem: 1.7175476551055908
------------------------------------------------------------------------------------------------------------
Iteration num: 4
tensor([0, 6, 8, 4, 6, 0, 2, 6, 2, 3, 1, 3, 9, 1, 8, 2, 6, 0, 7, 0, 2, 9, 7, 0,
        4, 7, 5, 6, 2, 8, 2, 7, 3, 0, 2, 4, 7, 2, 2, 2, 0, 3, 3, 5, 2, 3, 7, 6,
        6, 9, 0, 9, 1, 6, 2, 2, 5, 2, 7, 3, 0, 8, 8, 2, 9, 9, 5, 4, 6, 2, 8, 7,
        4, 8, 5, 6, 5, 8, 7, 4, 1, 8, 9, 2, 1, 8, 2, 7, 4, 4, 2, 2, 3, 6, 9, 2,
        3, 7, 1, 5, 5, 1, 0, 1, 6, 5, 2, 6, 4, 7, 0, 0, 0, 7, 7, 1, 8, 7, 3, 4,
        6, 5, 3, 1, 2, 1, 1, 9, 0, 3, 3, 0, 1, 8, 0, 0, 6, 2, 8, 7, 7, 1, 2, 9,
        4, 4, 3, 0, 8, 4, 5, 7, 1, 0, 8, 5, 8, 7, 2, 4, 5, 9, 8, 6, 1, 9, 8, 0,
        1, 2, 1, 0, 2, 2, 4, 7, 7, 4, 1, 7, 2, 3, 0, 4, 2, 5, 0, 6, 9, 2, 4, 7,
        0, 7, 2, 6, 3, 1, 9, 2, 8, 1, 1, 0, 8, 0, 2, 0, 2, 5, 1, 8, 5, 6, 9, 6,
        5, 6, 1, 0, 7, 6, 7, 1, 8, 5, 2, 3, 5, 4, 7, 2, 9, 6, 4, 9, 3, 4, 6, 6,
        1, 0, 0, 4, 7, 4, 8, 2, 7, 7, 5, 4, 7, 8, 5, 5], device='cuda:0')
tensor([[-1.0180, -0.8228,  0.5158,  ...,  0.3549,  0.0072, -1.9437],
        [ 0.1443, -0.7663,  1.4854,  ..., -0.7737, -1.0578, -2.4985],
        [-1.2494, -0.8026,  0.8820,  ...,  1.1737, -0.8699, -1.6476],
        ...,
        [-0.8916, -1.7457,  1.1313,  ..., -0.2323, -0.7291, -1.3231],
        [-1.6000, -0.0373,  0.0189,  ..., -0.5194, -0.4465, -1.7128],
        [-1.2338, -1.1997,  1.0362,  ...,  0.5314, -0.4950, -2.2386]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 1.7753264904022217, sdloss: 3.022984266281128
lossItem: 1.0905673503875732
------------------------------------------------------------------------------------------------------------
Iteration num: 5
tensor([6, 8, 3, 8, 4, 7, 8, 1, 4, 6, 7, 8, 3, 6, 9, 5, 6, 5, 4, 1, 6, 6, 4, 8,
        8, 7, 4, 7, 1, 1, 6, 9, 2, 7, 5, 7, 0, 3, 2, 7, 3, 9, 7, 7, 1, 1, 1, 2,
        6, 6, 4, 5, 5, 0, 2, 4, 2, 3, 6, 9, 0, 8, 9, 0, 0, 6, 5, 0, 3, 4, 8, 8,
        8, 3, 0, 9, 3, 6, 3, 0, 5, 5, 0, 6, 4, 6, 4, 6, 0, 7, 3, 4, 2, 4, 7, 4,
        9, 5, 3, 6, 9, 7, 6, 2, 4, 9, 2, 4, 8, 9, 0, 9, 1, 5, 4, 9, 8, 2, 0, 6,
        8, 3, 1, 7, 5, 4, 8, 9, 4, 3, 9, 3, 7, 6, 5, 4, 1, 6, 0, 7, 9, 4, 4, 0,
        4, 5, 1, 7, 7, 9, 8, 3, 8, 3, 6, 0, 3, 0, 2, 2, 9, 1, 5, 7, 9, 4, 4, 5,
        2, 7, 0, 3, 7, 9, 0, 8, 9, 2, 6, 2, 7, 6, 8, 3, 4, 7, 4, 7, 3, 5, 7, 0,
        6, 7, 5, 1, 5, 5, 2, 2, 9, 7, 7, 0, 3, 7, 1, 8, 2, 0, 4, 6, 5, 1, 9, 0,
        5, 5, 0, 0, 0, 1, 4, 5, 4, 7, 3, 4, 1, 9, 3, 4, 9, 5, 0, 9, 0, 4, 8, 7,
        9, 3, 0, 4, 6, 1, 2, 4, 8, 6, 4, 0, 1, 6, 4, 9], device='cuda:0')
tensor([[-0.3739, -2.0318,  0.8308,  ..., -1.5842, -1.1179, -2.8031],
        [-0.9913,  0.5701,  2.1547,  ...,  1.1520, -0.5210, -2.3975],
        [-0.5808, -0.2658,  0.6696,  ...,  0.8275, -0.4249, -1.5584],
        ...,
        [-0.8302, -0.4249,  0.8844,  ..., -0.0033, -1.1271, -1.9469],
        [-1.2628, -1.2556,  1.3446,  ...,  0.0410, -0.2970, -1.7601],
        [-0.7990, -1.3058,  0.9276,  ...,  0.2087,  0.3230, -1.5558]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 2.0537357330322266, sdloss: 3.4425933361053467
lossItem: 1.0173795223236084
------------------------------------------------------------------------------------------------------------
Iteration num: 6
tensor([0, 6, 5, 1, 3, 3, 4, 2, 4, 1, 5, 0, 7, 7, 3, 3, 5, 6, 0, 2, 6, 2, 6, 4,
        6, 4, 5, 2, 8, 2, 9, 3, 5, 4, 0, 5, 2, 3, 1, 3, 4, 3, 1, 1, 6, 7, 2, 5,
        5, 6, 9, 1, 8, 2, 9, 2, 4, 0, 3, 5, 5, 9, 7, 5, 0, 4, 0, 5, 7, 4, 6, 4,
        5, 7, 7, 8, 8, 7, 2, 1, 4, 3, 7, 3, 1, 6, 3, 2, 2, 6, 3, 1, 5, 8, 1, 2,
        1, 0, 8, 4, 0, 7, 3, 5, 1, 8, 8, 3, 9, 1, 3, 7, 7, 3, 7, 5, 0, 3, 5, 1,
        7, 4, 3, 0, 0, 1, 6, 8, 8, 5, 2, 7, 1, 9, 4, 0, 0, 0, 0, 5, 3, 6, 2, 6,
        7, 8, 8, 3, 5, 3, 5, 3, 1, 7, 5, 7, 3, 6, 1, 7, 3, 1, 7, 8, 7, 0, 0, 6,
        9, 8, 2, 3, 0, 1, 2, 3, 3, 9, 2, 3, 0, 9, 1, 6, 7, 5, 3, 5, 0, 1, 4, 4,
        3, 5, 2, 8, 4, 3, 9, 0, 3, 0, 6, 4, 9, 7, 4, 7, 3, 6, 8, 2, 4, 9, 0, 5,
        4, 0, 5, 4, 8, 9, 6, 1, 3, 6, 2, 0, 7, 9, 3, 0, 7, 4, 0, 7, 4, 7, 9, 9,
        9, 7, 3, 8, 1, 6, 5, 7, 9, 6, 2, 8, 2, 9, 8, 3], device='cuda:0')
tensor([[-2.3020, -1.6480,  1.7204,  ..., -0.2290, -0.7170, -2.3077],
        [-1.7064, -0.6016,  1.8866,  ...,  0.6248, -0.3797, -2.5742],
        [-1.4687,  0.2669,  1.4310,  ...,  0.5527,  0.2978, -2.1013],
        ...,
        [-0.8727, -1.0016,  0.6120,  ..., -0.7191,  0.1590, -2.0574],
        [-0.5944, -0.9997,  0.5103,  ...,  0.5657, -1.0921, -2.7526],
        [-1.6110, -0.7161,  1.2556,  ...,  0.2084,  0.8581, -1.2577]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 2.4133262634277344, sdloss: 3.8655548095703125
lossItem: 0.9866995215415955
------------------------------------------------------------------------------------------------------------
Iteration num: 7
tensor([0, 1, 2, 6, 6, 7, 5, 1, 0, 7, 6, 8, 6, 4, 1, 2, 7, 9, 4, 3, 7, 1, 0, 1,
        9, 2, 5, 8, 4, 3, 0, 3, 6, 6, 6, 9, 5, 2, 8, 4, 0, 4, 6, 9, 3, 3, 2, 1,
        9, 9, 7, 7, 7, 8, 3, 6, 9, 4, 4, 8, 4, 7, 7, 2, 7, 5, 0, 1, 7, 2, 2, 0,
        1, 3, 9, 3, 8, 8, 3, 4, 8, 5, 9, 9, 9, 1, 2, 8, 7, 5, 4, 2, 9, 9, 2, 6,
        7, 9, 8, 4, 6, 8, 3, 1, 5, 0, 8, 1, 2, 4, 9, 6, 8, 0, 6, 6, 5, 3, 4, 8,
        7, 2, 6, 3, 3, 6, 7, 2, 7, 4, 4, 2, 5, 9, 4, 1, 3, 0, 2, 7, 1, 0, 3, 7,
        5, 0, 4, 6, 4, 2, 5, 1, 2, 1, 6, 5, 6, 6, 3, 6, 1, 9, 6, 4, 9, 9, 1, 7,
        2, 4, 1, 5, 8, 2, 6, 9, 3, 4, 9, 9, 9, 0, 0, 9, 9, 6, 9, 2, 4, 8, 8, 7,
        6, 1, 6, 9, 3, 7, 0, 8, 5, 2, 0, 2, 2, 7, 1, 9, 2, 9, 1, 0, 9, 6, 1, 6,
        8, 4, 3, 1, 0, 2, 8, 2, 3, 9, 8, 4, 2, 1, 7, 7, 0, 6, 5, 2, 0, 2, 9, 5,
        9, 0, 0, 4, 8, 0, 8, 5, 0, 2, 6, 6, 1, 7, 8, 8], device='cuda:0')
tensor([[ -1.2120,  -2.0005,   1.6341,  ...,   0.6033,  -0.5305,  -3.2017],
        [ -1.8019,  -0.4143,   1.4597,  ...,   0.8472,  -0.2829,  -1.6098],
        [ -2.7603,  -1.1355,   2.4807,  ...,   0.5621,  -0.0642,  -3.7420],
        ...,
        [ -1.2322,  -1.4504,   1.4420,  ...,   0.6432,   0.4821,  -2.5214],
        [-15.2949,  -7.3248,  25.2044,  ...,   1.4242,   1.4580, -13.7431],
        [ -2.4870,  -0.2798,   1.9545,  ...,  -0.7157,  -1.0314,  -2.4235]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 2.6089565753936768, sdloss: 4.552063465118408
lossItem: 0.8715435266494751
------------------------------------------------------------------------------------------------------------
Iteration num: 8
tensor([6, 6, 0, 8, 0, 0, 7, 6, 2, 2, 9, 7, 6, 6, 4, 1, 0, 0, 5, 6, 5, 7, 3, 6,
        9, 9, 5, 5, 3, 4, 3, 3, 9, 9, 8, 7, 6, 2, 7, 0, 1, 6, 7, 1, 3, 2, 6, 0,
        7, 8, 3, 7, 5, 8, 9, 1, 7, 6, 1, 9, 8, 9, 7, 3, 4, 7, 9, 6, 8, 9, 5, 9,
        9, 9, 2, 8, 2, 4, 9, 7, 2, 6, 1, 5, 7, 3, 5, 1, 3, 1, 4, 8, 9, 4, 6, 3,
        8, 4, 1, 8, 8, 9, 8, 0, 1, 2, 3, 5, 0, 4, 9, 4, 0, 2, 9, 2, 8, 5, 0, 8,
        9, 6, 2, 1, 6, 0, 7, 4, 7, 2, 5, 5, 3, 8, 5, 0, 5, 0, 9, 1, 0, 8, 6, 4,
        0, 0, 1, 0, 2, 1, 8, 4, 2, 7, 0, 4, 2, 4, 2, 4, 9, 7, 6, 9, 2, 7, 7, 9,
        6, 4, 4, 0, 9, 7, 0, 5, 8, 1, 5, 5, 3, 8, 1, 7, 6, 9, 4, 1, 1, 6, 4, 2,
        9, 2, 4, 5, 7, 5, 5, 5, 0, 6, 4, 7, 2, 1, 7, 7, 2, 8, 2, 4, 9, 5, 6, 4,
        3, 2, 5, 0, 2, 3, 7, 5, 8, 7, 3, 3, 2, 1, 7, 0, 0, 9, 3, 7, 8, 8, 7, 8,
        9, 7, 3, 5, 4, 5, 8, 7, 2, 7, 8, 7, 4, 7, 6, 6], device='cuda:0')
tensor([[ -0.7124,  -0.7721,   0.6896,  ...,   0.2432,  -0.4004,  -1.7503],
        [ -0.9975,  -1.2593,   1.7826,  ...,   0.7399,  -0.7758,  -3.1522],
        [-10.9876,  -7.0715,  17.1001,  ...,   0.4595,  -2.0494, -11.1477],
        ...,
        [ -0.7248,  -0.7171,   0.8077,  ...,   0.2410,  -0.2197,  -1.9328],
        [ -0.7845,  -1.0941,   0.5888,  ...,   0.4366,  -0.3011,  -1.8252],
        [ -0.7055,  -0.7121,   0.6080,  ...,   0.2347,  -0.2328,  -1.7705]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 2.9365410804748535, sdloss: 5.954787254333496
lossItem: 0.7344008088111877
------------------------------------------------------------------------------------------------------------
Iteration num: 9
tensor([3, 2, 2, 5, 5, 9, 7, 8, 9, 5, 7, 4, 9, 9, 1, 6, 4, 2, 8, 7, 1, 2, 0, 8,
        3, 4, 9, 3, 7, 3, 8, 0, 1, 6, 4, 2, 8, 9, 4, 6, 6, 7, 5, 6, 6, 8, 9, 6,
        1, 8, 3, 8, 6, 3, 1, 5, 3, 3, 5, 9, 3, 4, 2, 3, 0, 3, 8, 0, 9, 1, 5, 0,
        4, 8, 6, 0, 3, 9, 3, 9, 7, 7, 7, 8, 1, 3, 3, 7, 2, 6, 1, 7, 4, 0, 2, 6,
        7, 1, 2, 6, 1, 8, 1, 8, 5, 5, 5, 3, 6, 5, 5, 9, 5, 0, 5, 7, 8, 6, 3, 3,
        8, 4, 5, 9, 2, 1, 3, 6, 1, 2, 0, 2, 3, 2, 6, 0, 9, 5, 5, 0, 4, 9, 5, 4,
        9, 4, 3, 1, 8, 1, 0, 2, 4, 9, 1, 1, 1, 1, 3, 1, 4, 9, 9, 7, 0, 3, 1, 9,
        4, 2, 6, 6, 7, 5, 3, 0, 5, 3, 1, 1, 9, 9, 5, 4, 7, 9, 6, 7, 1, 5, 6, 9,
        1, 6, 3, 4, 7, 5, 3, 9, 9, 3, 5, 7, 8, 6, 3, 2, 1, 3, 0, 0, 9, 4, 4, 0,
        6, 5, 9, 7, 3, 2, 7, 6, 4, 1, 9, 1, 5, 6, 8, 1, 7, 6, 9, 0, 5, 2, 1, 8,
        2, 7, 1, 6, 9, 4, 1, 7, 0, 0, 6, 2, 6, 2, 7, 2], device='cuda:0')
tensor([[-3.3316, -0.7074,  3.9191,  ..., -0.1669, -1.0646, -6.2732],
        [-1.6146, -1.7145,  1.6094,  ...,  1.8238, -0.8163, -3.8408],
        [-1.2954, -1.1230,  1.4273,  ...,  1.1358, -0.4365, -3.4556],
        ...,
        [-1.2492, -0.4321,  2.2156,  ..., -0.2905, -0.5351, -2.7494],
        [-2.2683, -0.8141,  1.8049,  ..., -0.1424, -0.7121, -3.8456],
        [-3.1600, -0.8897,  2.0659,  ...,  0.1129, -1.1949, -3.0344]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 3.325645923614502, sdloss: 6.283945083618164
lossItem: 0.7559524774551392
------------------------------------------------------------------------------------------------------------
Iteration num: 10
tensor([2, 4, 4, 3, 9, 5, 7, 9, 2, 8, 0, 7, 7, 8, 0, 9, 7, 7, 3, 0, 3, 4, 4, 2,
        0, 3, 5, 4, 0, 3, 2, 2, 5, 3, 3, 8, 2, 8, 1, 4, 3, 1, 1, 7, 7, 1, 3, 2,
        6, 8, 2, 9, 0, 4, 1, 7, 2, 0, 0, 1, 6, 2, 8, 4, 5, 9, 0, 4, 7, 4, 5, 6,
        3, 0, 3, 2, 4, 1, 6, 3, 7, 4, 2, 5, 1, 2, 5, 9, 4, 5, 1, 9, 8, 2, 5, 9,
        2, 1, 7, 1, 0, 0, 8, 4, 2, 3, 0, 1, 0, 3, 5, 1, 0, 2, 6, 1, 3, 5, 6, 8,
        8, 3, 1, 3, 7, 8, 2, 5, 3, 7, 4, 9, 3, 2, 8, 1, 9, 6, 9, 7, 2, 4, 2, 4,
        0, 1, 6, 0, 1, 6, 3, 0, 2, 1, 3, 7, 0, 4, 2, 2, 2, 9, 2, 8, 0, 1, 9, 2,
        3, 7, 8, 6, 8, 8, 0, 9, 8, 6, 4, 5, 8, 1, 0, 3, 7, 2, 9, 8, 7, 0, 1, 1,
        1, 3, 3, 9, 4, 6, 2, 1, 0, 6, 1, 1, 0, 4, 2, 1, 8, 8, 3, 3, 1, 7, 7, 5,
        0, 9, 2, 3, 6, 3, 7, 6, 3, 4, 1, 6, 1, 0, 0, 4, 7, 7, 8, 1, 3, 5, 3, 3,
        3, 4, 8, 4, 9, 2, 8, 4, 0, 7, 2, 1, 0, 1, 0, 0], device='cuda:0')
tensor([[ -5.6106,  -1.6667,   5.2138,  ...,   0.4415,  -0.4450,  -6.3135],
        [ -8.9125,  -3.7542,  19.5806,  ...,  -0.4064,   0.3483, -11.4086],
        [ -1.0877,  -1.1826,   0.8344,  ...,  -0.1360,  -0.6348,  -2.6532],
        ...,
        [ -0.7358,  -0.7670,   0.9107,  ...,  -0.0585,  -0.3440,  -1.8350],
        [ -4.5926,  -2.8046,   5.5985,  ...,   0.7487,  -1.1452,  -4.6941],
        [ -2.8975,  -1.7148,   2.8584,  ...,   0.5823,  -0.2585,  -4.7432]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 3.765282392501831, sdloss: 7.975272178649902
lossItem: 0.5635203123092651
------------------------------------------------------------------------------------------------------------
Iteration num: 11
tensor([8, 9, 9, 8, 2, 7, 4, 2, 3, 7, 1, 1, 2, 2, 6, 7, 9, 4, 3, 1, 3, 3, 9, 8,
        5, 8, 9, 2, 9, 8, 0, 1, 4, 9, 8, 2, 6, 5, 1, 1, 4, 8, 5, 6, 8, 9, 7, 9,
        7, 6, 5, 5, 9, 3, 3, 6, 6, 4, 8, 0, 7, 3, 8, 6, 8, 1, 0, 4, 8, 1, 6, 7,
        4, 8, 5, 4, 6, 0, 7, 2, 1, 2, 1, 9, 8, 1, 0, 3, 5, 6, 5, 9, 2, 6, 1, 2,
        5, 9, 6, 7, 3, 0, 1, 1, 2, 9, 7, 5, 4, 3, 5, 1, 3, 7, 2, 2, 6, 3, 9, 2,
        0, 6, 4, 3, 4, 0, 7, 3, 4, 3, 7, 1, 4, 9, 5, 9, 7, 9, 1, 4, 2, 3, 1, 6,
        9, 4, 7, 7, 0, 6, 9, 0, 8, 3, 2, 0, 3, 8, 8, 9, 0, 6, 8, 3, 3, 7, 0, 5,
        4, 4, 9, 1, 2, 4, 4, 5, 0, 7, 2, 4, 3, 2, 6, 2, 7, 3, 8, 3, 2, 9, 3, 6,
        4, 6, 2, 1, 3, 5, 7, 3, 9, 4, 0, 6, 6, 0, 2, 5, 3, 6, 4, 5, 2, 2, 2, 6,
        4, 8, 5, 5, 4, 8, 3, 1, 4, 3, 2, 4, 2, 6, 8, 0, 7, 9, 4, 9, 4, 6, 0, 2,
        9, 2, 2, 4, 3, 3, 4, 9, 1, 1, 2, 6, 1, 9, 9, 4], device='cuda:0')
tensor([[-0.7936, -0.6890,  0.8799,  ...,  0.4396, -0.1954, -1.9341],
        [-0.8742, -0.8172,  0.9014,  ...,  0.4542, -0.2890, -1.4936],
        [-0.8265, -0.9503,  0.8600,  ...,  0.8367, -0.4419, -1.8943],
        ...,
        [-1.2959, -0.6144,  1.8038,  ..., -0.4317, -0.0171, -2.9657],
        [-0.9658, -0.5634,  1.6531,  ...,  0.4865, -0.3951, -1.8265],
        [-1.1988, -1.1405,  0.9006,  ...,  0.7609, -0.4783, -2.2097]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 4.061051368713379, sdloss: 8.536210060119629
lossItem: 0.564131498336792
------------------------------------------------------------------------------------------------------------
Iteration num: 12
tensor([2, 4, 5, 7, 3, 7, 4, 2, 6, 2, 4, 7, 4, 3, 6, 6, 6, 0, 4, 0, 8, 7, 7, 3,
        5, 8, 9, 1, 8, 0, 9, 8, 6, 1, 4, 9, 4, 4, 9, 1, 6, 8, 1, 9, 2, 5, 2, 0,
        2, 3, 0, 2, 7, 4, 1, 8, 4, 2, 7, 5, 7, 6, 5, 8, 8, 1, 0, 9, 6, 9, 1, 0,
        2, 5, 3, 2, 9, 3, 2, 6, 7, 4, 4, 8, 5, 1, 0, 2, 4, 2, 8, 5, 2, 6, 7, 1,
        3, 3, 9, 7, 0, 4, 8, 5, 6, 0, 6, 9, 2, 4, 1, 7, 8, 3, 0, 6, 6, 0, 1, 6,
        5, 1, 1, 9, 8, 2, 3, 6, 6, 9, 2, 0, 1, 7, 9, 2, 3, 3, 6, 6, 9, 2, 5, 3,
        1, 9, 4, 9, 5, 9, 5, 7, 3, 1, 5, 5, 9, 8, 6, 8, 7, 1, 5, 3, 4, 6, 2, 7,
        3, 7, 0, 7, 1, 6, 7, 4, 2, 8, 0, 6, 7, 8, 0, 0, 5, 5, 5, 9, 9, 1, 4, 1,
        6, 1, 2, 6, 5, 1, 2, 1, 8, 1, 2, 1, 1, 8, 8, 9, 0, 2, 4, 4, 1, 3, 7, 8,
        7, 6, 0, 3, 4, 0, 4, 2, 3, 5, 1, 2, 4, 2, 9, 5, 6, 4, 6, 3, 8, 8, 0, 8,
        7, 6, 8, 1, 6, 5, 7, 0, 8, 7, 5, 3, 4, 3, 6, 2], device='cuda:0')
tensor([[-3.4676, -1.7467,  3.2783,  ..., -0.1886, -1.3624, -4.7153],
        [-0.5332, -0.7577,  0.7168,  ...,  0.5388, -0.2941, -1.6675],
        [-0.5332, -0.7577,  0.7168,  ...,  0.5388, -0.2941, -1.6675],
        ...,
        [-0.5332, -0.7577,  0.7168,  ...,  0.5388, -0.2941, -1.6675],
        [-0.5332, -0.7577,  0.7168,  ...,  0.5388, -0.2941, -1.6675],
        [-0.9877, -0.9210,  1.6490,  ...,  0.4779, -0.2820, -2.3817]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 4.560257911682129, sdloss: 10.405665397644043
lossItem: 0.49603745341300964
------------------------------------------------------------------------------------------------------------
Iteration num: 13
tensor([0, 4, 7, 7, 7, 7, 7, 9, 6, 3, 5, 9, 2, 2, 2, 2, 5, 0, 7, 8, 9, 9, 7, 8,
        0, 9, 4, 6, 7, 2, 0, 6, 9, 0, 9, 6, 4, 5, 1, 6, 9, 2, 1, 9, 1, 8, 1, 8,
        0, 4, 7, 6, 1, 2, 0, 1, 2, 5, 7, 2, 0, 4, 3, 7, 8, 0, 6, 4, 5, 3, 1, 7,
        5, 5, 4, 8, 1, 2, 7, 4, 3, 5, 4, 3, 3, 0, 7, 1, 4, 5, 3, 5, 6, 6, 8, 1,
        8, 1, 5, 2, 9, 5, 7, 1, 2, 9, 5, 4, 1, 1, 2, 7, 6, 5, 0, 5, 1, 3, 5, 2,
        0, 9, 6, 4, 1, 5, 3, 0, 9, 1, 6, 7, 4, 9, 6, 5, 4, 5, 6, 8, 6, 1, 6, 3,
        1, 7, 8, 2, 2, 1, 0, 7, 3, 2, 1, 0, 6, 2, 0, 7, 5, 6, 7, 3, 0, 1, 3, 2,
        3, 7, 0, 5, 3, 1, 4, 8, 2, 1, 6, 0, 6, 6, 8, 9, 0, 9, 0, 2, 6, 4, 5, 9,
        3, 9, 3, 7, 4, 9, 2, 0, 5, 8, 1, 7, 9, 3, 4, 0, 8, 9, 7, 3, 3, 8, 4, 6,
        2, 5, 5, 1, 9, 5, 8, 9, 5, 6, 0, 4, 0, 6, 8, 3, 1, 5, 8, 1, 7, 1, 3, 4,
        0, 0, 7, 3, 9, 1, 7, 1, 8, 2, 4, 3, 1, 1, 4, 8], device='cuda:0')
tensor([[-4.7163, -2.5053,  8.4166,  ..., -0.2740, -1.3736, -7.8538],
        [-4.7163, -2.5053,  8.4166,  ..., -0.2740, -1.3736, -7.8538],
        [-4.7163, -2.5053,  8.4166,  ..., -0.2740, -1.3736, -7.8538],
        ...,
        [-4.7163, -2.5053,  8.4166,  ..., -0.2740, -1.3736, -7.8538],
        [-4.7163, -2.5053,  8.4166,  ..., -0.2740, -1.3736, -7.8538],
        [-4.7163, -2.5053,  8.4166,  ..., -0.2740, -1.3736, -7.8538]],
       device='cuda:0', grad_fn=<MmBackward>)
sdactivations: 4.476424217224121, sdloss: 4.369915008544922
lossItem: 2.0310254096984863
------------------------------------------------------------------------------------------------------------
Iteration num: 14
tensor([5, 3, 4, 0, 6, 3, 4, 8, 4, 6, 8, 5, 1, 9, 2, 3, 4, 6, 8, 2, 3, 1, 3, 6,
        7, 1, 3, 2, 2, 4, 5, 1, 3, 1, 9, 1, 2, 8, 9, 4, 4, 1, 1, 6, 1, 8, 5, 1,
        7, 0, 1, 4, 7, 6, 1, 7, 4, 3, 2, 8, 6, 0, 5, 5, 1, 1, 8, 1, 5, 3, 9, 2,
        0, 4, 8, 2, 7, 3, 6, 6, 0, 4, 2, 9, 6, 8, 4, 4, 7, 8, 9, 1, 6, 1, 2, 5,
        9, 1, 8, 4, 6, 4, 8, 5, 5, 8, 4, 7, 4, 0, 7, 3, 7, 7, 7, 3, 0, 6, 5, 7,
        7, 3, 9, 9, 8, 5, 5, 5, 9, 2, 0, 5, 1, 8, 4, 9, 5, 9, 5, 9, 7, 1, 4, 9,
        2, 2, 8, 2, 2, 2, 0, 0, 5, 0, 3, 0, 1, 1, 0, 5, 7, 7, 4, 8, 9, 5, 5, 0,
        8, 1, 0, 1, 4, 5, 0, 1, 4, 8, 6, 0, 7, 3, 6, 2, 5, 9, 6, 8, 8, 2, 3, 5,
        7, 8, 2, 1, 6, 0, 5, 0, 3, 7, 1, 9, 5, 0, 1, 4, 9, 3, 7, 5, 6, 4, 5, 6,
        3, 8, 3, 6, 9, 1, 7, 0, 6, 7, 8, 9, 3, 6, 0, 7, 3, 2, 2, 0, 6, 5, 6, 1,
        4, 2, 9, 1, 2, 1, 5, 2, 7, 8, 2, 2, 0, 8, 6, 9], device='cuda:0')
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<MmBackward>)
sdactivations: nan, sdloss: nan
lossItem: nan
------------------------------------------------------------------------------------------------------------

I tried 5 times. It gets nan around the 13th iteration. The first tensor in an iteration is the target and the second is the output.

As it seems to be the same iteration where the nan happens, could it be something in your batch? Have you tried shuffling your data from your dataloader? Also, one thing I noticed is that the output Tensor on iteration 13 is all the same. Could this potentially be an issue?