PyTorch loss function exception in Google CoLab

Waoush · March 11, 2024, 8:43pm

On my machine I was able to train a CNN using:

PyTorch 1.6.0 and 
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0

I needed more memory to expand my testing, so I ported my code to Google CoLab using the T4 GPU instance. Every time I run the code, I keep getting an exception when I run the loss function nn.CrossEntropyLoss() during training. I make sure to reset the session between runs.

The exception ends at:

I looked at the implementation:

pytorch/aten/src/ATen/native/LossNLL.cpp at 2981534f54d49fa3a9755c9b0855e7929c2527f0 · pytorch/pytorch · GitHub

to see where an exception or error message may be raised, but it seems like the exceptions that can be raised are mostly due to preconditions. These preconditions shouldn’t be violated because I don’t configure nn.CrossEntropyLoss() at all.

I added

import os

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

import torch
import torchvision

from torch import nn

To the top of my code block, but I still can’t seem to get any useful error message. When I looked at the implementation of cross entropy on my machine, I noticed that it is different from what is currently implemented in functional.py:

    if not torch.jit.is_scripting():
        tens_ops = (input, target)
        if any([type(t) is not Tensor for t in tens_ops]) and has_torch_function(tens_ops):
            return handle_torch_function(
                cross_entropy, tens_ops, input, target, weight=weight,
                size_average=size_average, ignore_index=ignore_index, reduce=reduce,
                reduction=reduction)
    if size_average is not None or reduce is not None:
        reduction = _Reduction.legacy_get_string(size_average, reduce)
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)

In the older version, it calls nll_loss and passes as a first argument the softmax of the input, which doesn’t seem to happen now:

    if size_average is not None or reduce is not None:
        reduction = _Reduction.legacy_get_string(size_average, reduce)
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)

Here is my training code:

def train(network, loss, loader, optimizer):
    network.train()

    for batch_index, (image_tensor, target_output) in enumerate(loader):
        image_tensor = image_tensor.float().cuda()
        target_output = target_output.cuda()

        logits = network(image_tensor)
        print(f"Logits shape: {logits.shape} of {logits}")
        #print(f"Target: {target_output.shape} of {target_output}")

        result = loss(logits, target_output)

        result.backward()
        optimizer.step()
        optimizer.zero_grad()

I also tried passing in softmax’d input to the loss function over in my CoLab code block, but that didn’t seem to make any difference. The documentation says that the input should be unnormalized logits and they appear to be:

        [-8.3126e-01, -9.1190e-01,  8.6249e-01, -2.0465e-02, -8.8040e-01,
          5.0193e-02, -4.5052e-02,  7.8031e-02,  4.2935e-01, -1.3772e-01,
         -6.2306e-01,  9.9225e-02,  7.3910e-01,  2.2487e-01,  8.8633e-02,
          3.9571e-01,  2.3213e-01, -2.1260e-01, -3.4335e-01,  1.3615e-01,
         -5.2874e-01,  1.5202e-01, -7.1269e-02,  4.8146e-01,  3.4631e-01]

Any ideas? I cannot change the CUDA or PyTorch versions on my machine. It is old and space is becoming an issue.

Thanks for any help!

ptrblck · March 11, 2024, 10:45pm

Your target is most likely containing class indices, which are out of range ([0, nb_classes-1] would be expected), so double check it.

Waoush · March 12, 2024, 4:01pm

Thanks for the response!

I was able to get a new CoLab instance just now and did some checks. It turns out that the target does include indices that are out of bounds. Why would it be doing that? That doesn’t happen on my machine, and everything is matching up correctly on CoLab. Here is what I print before the error occurs:

Logits shape: torch.Size([8, 25]) of tensor([[ 0.0151,  0.0844,  0.0451, -0.0229,  0.0260, -0.0169,  0.0610,  0.0382,
         -0.0026,  0.0551, -0.0470, -0.0492,  0.0333,  0.0074,  0.0731,  0.0416,
          0.0275, -0.0459, -0.0892,  0.0045, -0.0972, -0.0312,  0.0122,  0.0468,
          0.0102],
        [-0.0039,  0.1567,  0.0278,  0.0092,  0.0556,  0.0004,  0.1217,  0.0848,
         -0.0496, -0.0105, -0.0650, -0.0508, -0.0378, -0.0098,  0.0972,  0.0051,
          0.0674, -0.0691, -0.1155,  0.0150, -0.0842, -0.0309, -0.0158,  0.0422,
          0.0472],
        [-0.0005,  0.0877,  0.0594, -0.0100,  0.0744, -0.0077,  0.0556,  0.0758,
         -0.0165, -0.0194, -0.0294, -0.0602,  0.0110, -0.0065,  0.0406,  0.0388,
          0.0323, -0.0367, -0.1238, -0.0075, -0.0427, -0.0403, -0.0252,  0.0734,
          0.0309],
        [-0.0079,  0.0864,  0.0343, -0.0028,  0.0288, -0.0207,  0.0658,  0.0593,
          0.0326, -0.0038, -0.0774, -0.0244, -0.0158,  0.0162,  0.0489,  0.0347,
          0.0807, -0.0612, -0.0935, -0.0071, -0.0875, -0.0078, -0.0237,  0.0173,
          0.0240],
        [-0.0185,  0.0393,  0.0462, -0.0302,  0.0518,  0.0097,  0.0618,  0.0788,
         -0.0336,  0.0118, -0.0674, -0.0130,  0.0396,  0.0139,  0.0370,  0.0368,
          0.0216, -0.0368, -0.0863, -0.0132, -0.0775,  0.0270, -0.0112,  0.0196,
          0.0276],
        [-0.0343,  0.0708,  0.0013,  0.0171,  0.0727,  0.0008,  0.0888,  0.0667,
         -0.0466,  0.0104, -0.0821, -0.0451, -0.0173,  0.0220,  0.0713,  0.0307,
          0.0053, -0.1087, -0.1545, -0.0322, -0.0770, -0.0472, -0.0371,  0.0284,
          0.0477],
        [-0.0188,  0.0784,  0.0415, -0.0275,  0.0240,  0.0069,  0.1000,  0.0723,
         -0.0234, -0.0108, -0.0928,  0.0120,  0.0471, -0.0183,  0.0918,  0.0422,
          0.0310, -0.0797, -0.1333, -0.0159, -0.0848,  0.0143, -0.0265,  0.0639,
          0.0391],
        [-0.0160,  0.0411,  0.0625,  0.0094,  0.0911,  0.0530,  0.1495,  0.0739,
         -0.0588,  0.0484, -0.0616, -0.0560,  0.0106,  0.0077,  0.0703,  0.0132,
          0.0675, -0.1769, -0.1195, -0.0349, -0.1200,  0.0174, -0.0104,  0.0842,
         -0.0053]], device='cuda:0', grad_fn=<AddmmBackward0>)
Target: torch.Size([8]) of tensor([15,  9, 20, 18, 28, 24, 12, 26], device='cuda:0')

This looks right to me. There are 8 tensors (since that is my batch size) and each tensor has 25 items, which is the number of predictions I wanted to make.

Here is my code to train the CNN:

outputs = 25

train_loader = get_training_set(number_of_classes=outputs)
test_loader = get_test_set(number_of_classes=outputs)

recognizer = FaceRecognizerV11(outputs)

recognizer.to(torch.device("cuda"))

learning_rate = 0.01
loss = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(recognizer.parameters(), lr=learning_rate, momentum=0.9, weight_decay=0.0005)
epochs = 75

print("Beginning training rounds")

for e in range(epochs):
   train(recognizer, loss, train_loader, optimizer)
   test(recognizer, test_loader, loss)
   print(f"Completed round {e + 1} of training")

print("Done")

The CNN:

class FaceRecognizerV11(nn.Module):
    def __init__(self, n_classes):
        super().__init__()

        ...

        self.dnn = nn.Sequential(
            nn.Linear(1024, 4096), nn.ReLU(),
            nn.Linear(4096, 4096), nn.ReLU(),
            nn.Linear(4096, n_classes)
        )

    def forward(self, image_tensor):
        v1 = self.v1(image_tensor)
        v2 = self.v2(v1)
        v3 = self.v3(v2)
        v4 = self.v4(v3)
        return self.dnn(v4)

I made a helper function to load the correct amount of images based on the number of classes I want to predict:

def get_training_set(number_of_classes=0):
    train_image_loader = ImageFolder("./classification_data/train_data", transform=pil_resize_transform)

    if number_of_classes > 0:
        directories = next(os.walk("./classification_data/train_data"))[1]

        dir_count = 0
        file_count = 0

        for dir in directories:
            if dir_count == number_of_classes:
                break

            files = next(os.walk(f"./classification_data/train_data/{dir}"))[2]
            file_count += len(files)
            dir_count += 1

        print(f"Training on {file_count} files for {number_of_classes} classes")

        train_subset = Subset(train_image_loader, range(0, file_count))
        return DataLoader(train_subset, batch_size=8, shuffle=True)

    return DataLoader(train_image_loader, batch_size=8, shuffle=True)

I don’t train on the entire dataset because it is fairly large and since I am trying to learn, I want to get feedback as quick as possible.

Waoush · March 12, 2024, 4:25pm

I ran my code twice on CoLab just now, and it is successfully completing the first iteration or training step, but then I get the same error I posted originally on the second run of the loop.

Logits shape: torch.Size([8, 25]) of tensor([[-0.1255,  0.0759, -0.1150,  0.1603,  0.0724,  0.1047, -0.1426,  0.0664,
          0.0272, -0.0143, -0.0096, -0.1174, -0.0596,  0.0459,  0.0320, -0.0280,
         -0.1828, -0.0235, -0.0302, -0.0399, -0.0015,  0.0222, -0.1437,  0.1746,
          0.0822],
        [-0.0871,  0.0553, -0.0816,  0.0923,  0.0888,  0.0804, -0.0912,  0.1014,
         -0.0172, -0.0050, -0.0110, -0.0460,  0.0112,  0.0640,  0.0230, -0.0584,
         -0.0829,  0.0670, -0.0285, -0.0625,  0.0008,  0.0550, -0.1184,  0.1357,
          0.0782],
        [-0.0638,  0.0372, -0.0814,  0.1174,  0.0483,  0.0111, -0.1310,  0.0859,
          0.0242, -0.0220, -0.0253, -0.0134, -0.0573,  0.0361,  0.0266, -0.0582,
         -0.0495, -0.0031, -0.0668, -0.0588,  0.0009,  0.0561, -0.1064,  0.1379,
          0.0719],
        [-0.0992,  0.0647, -0.1112,  0.1164,  0.0308, -0.0150, -0.0795,  0.0950,
          0.0232, -0.0195, -0.0315, -0.0385, -0.0428,  0.0391, -0.0045, -0.0251,
         -0.0842,  0.0243, -0.0696, -0.0879, -0.0141,  0.0234, -0.0790,  0.1146,
          0.0256],
        [-0.0671,  0.0155, -0.0912,  0.0957,  0.0118,  0.0564, -0.1442,  0.1169,
          0.0408,  0.0200, -0.0681, -0.0240, -0.0319,  0.0412, -0.0013, -0.0427,
         -0.1007,  0.0007, -0.0449, -0.0658,  0.0297,  0.0449, -0.0994,  0.1518,
          0.0570],
        [-0.0590,  0.0193, -0.0924,  0.0565,  0.0405,  0.0223, -0.0490,  0.1009,
          0.0535,  0.0249, -0.0135, -0.0146, -0.0386,  0.0249,  0.0230, -0.0194,
         -0.0945,  0.0163, -0.0410, -0.0414,  0.0143,  0.0340, -0.0820,  0.1266,
          0.0539],
        [-0.0470,  0.0680, -0.0393,  0.0289,  0.0345,  0.0081, -0.0273,  0.0604,
          0.0018,  0.0367, -0.0272, -0.0118, -0.0163,  0.0150, -0.0177, -0.0669,
         -0.0364,  0.0126, -0.0282, -0.0310,  0.0139,  0.0238, -0.0688,  0.0839,
          0.0280],
        [-0.0768,  0.0333, -0.0688,  0.0571,  0.0643,  0.0533, -0.0821,  0.1093,
          0.0164,  0.0002,  0.0059, -0.0643, -0.0437,  0.0219, -0.0309, -0.0227,
         -0.0997, -0.0299, -0.0487, -0.0477,  0.0039,  0.0603, -0.0688,  0.1243,
          0.0600]], device='cuda:0', grad_fn=<AddmmBackward0>)
Target: torch.Size([8]) of tensor([ 5, 12, 16, 17, 23,  3,  8, 15], device='cuda:0')
====== Completed training step ======
Logits shape: torch.Size([8, 25]) of tensor([[-0.0973, -0.0115, -0.1486,  0.1848,  0.0162,  0.1735, -0.1497,  0.0244,
          0.0375, -0.0495, -0.0872, -0.0726,  0.0295,  0.0015,  0.0055,  0.0468,
          0.0407,  0.0967, -0.0670, -0.1057, -0.0734, -0.0044, -0.1296,  0.2242,
          0.0653],
        [-0.0921, -0.0143, -0.1141,  0.1603,  0.0211,  0.1760, -0.1089,  0.0246,
          0.0066, -0.0569, -0.0460, -0.1028,  0.0758, -0.0254, -0.0558,  0.0327,
          0.0198,  0.0699, -0.0698, -0.0811, -0.0742,  0.0155, -0.1050,  0.1628,
          0.0224],
        [-0.1607, -0.0140, -0.1222,  0.1382, -0.0308,  0.1994, -0.1366,  0.0672,
          0.0695, -0.0317, -0.0238, -0.0791,  0.1253, -0.0059, -0.0251,  0.0416,
          0.0348,  0.0810, -0.0543, -0.1101, -0.0832,  0.0097, -0.1652,  0.2165,
          0.0247],
        [-0.0823,  0.0317, -0.0992,  0.1124,  0.0386,  0.1233, -0.0848,  0.0384,
          0.0240, -0.0274, -0.0597, -0.0358,  0.0575, -0.0125, -0.0343,  0.0070,
          0.0005,  0.0471, -0.0644, -0.0861, -0.0565, -0.0235, -0.1466,  0.1857,
         -0.0085],
        [-0.1242,  0.0043, -0.1284,  0.1757,  0.0132,  0.1958, -0.1380,  0.0207,
         -0.0116, -0.0246, -0.0477, -0.0749,  0.0755,  0.0006, -0.0160,  0.0366,
          0.0113,  0.0806, -0.0480, -0.0598, -0.0597, -0.0116, -0.1059,  0.1659,
          0.0109],
        [-0.1300, -0.0492, -0.1283,  0.1664, -0.0068,  0.1810, -0.1465,  0.0377,
          0.0473, -0.0555, -0.0953, -0.0392,  0.1370, -0.0408, -0.0563, -0.0244,
          0.0447,  0.1308, -0.0548, -0.0945, -0.0883, -0.0067, -0.1399,  0.2268,
          0.0091],
        [-0.1834, -0.0033, -0.1602,  0.2314, -0.0280,  0.1942, -0.1943,  0.0206,
          0.0547, -0.0851, -0.0766, -0.0805,  0.1020,  0.0066, -0.0569,  0.0104,
         -0.0272,  0.0943, -0.0597, -0.1417, -0.0371, -0.0107, -0.1878,  0.2559,
          0.0331],
        [-0.1011, -0.0402, -0.1285,  0.1799,  0.0103,  0.1297, -0.1156,  0.0409,
          0.0411, -0.0323, -0.0626, -0.0635,  0.0572,  0.0094, -0.0226,  0.0734,
          0.0255,  0.1027, -0.0821, -0.0864, -0.0647, -0.0055, -0.1096,  0.2382,
          0.0116]], device='cuda:0', grad_fn=<AddmmBackward0>)
Target: torch.Size([8]) of tensor([27, 20,  9, 26,  8, 23,  6, 10], device='cuda:0')

First run it returns the correct indices, but then the second run I am getting out of bounds indices again.

ptrblck · March 12, 2024, 9:01pm

In the past other users saw similar issues and e.g. did not realize that additional folders were created thus increasing the class count. The targets won’t change themselves randomly.

Waoush · March 13, 2024, 12:19am

So I took a look at the data loading portion of my code and it looks like Subset only holds references to a subset of only data, and not the classes/counts themselves. So even though it is subsetting the correct amount of images, it retains the total of number of classes that were loaded from ImageFolder earlier on, which has 4000 classes… This behavior was confirmed on both my CoLab instance and my local machine.

I was hoping I could just load all of the images I have, and then incrementally take slices of that to test on. It seems like the best and easiest way to do this is just create a directory with a subset of the classes/images that I want to test directly.

I still don’t know why my machine doesn’t run into these indexing issues. I have trained my CNN on here dozens of times and it has never been a problem. Only when I ported everything to CoLab’s T4 GPU did this start happening. CoLab has the same dataset as what I am working with locally.

In any case, this no longer has anything to do with the loss function. Thank you very much for the assistance @ptrblck !