CNN gets stuck in loop - probably if statement cause

I have had a lots of problems with this notebook but hopefully this is the last one:

I now have:

  • All my inputs as tensors
  • Both the data and model (including fc) on the GPU
  • Resized all the images to the same size
  • Changed requires_grad = True for the fc

My model will only do one forward pass though before sitting idle. I think it is something to do with the if statement. When steps = 1 then steps % print_every != 0 and I am assuming the code doesn’t know what to do next?

# Train the classifier
def train_classifier(model, optimizer, criterion,train_loader, valid_loader, epochs):

    steps = 0
    print_every = 1

    for e in range(epochs):

        model.train()

        running_loss = 0

        for images, labels in iter(train_loader):            
            images, labels = images.cuda(), labels.cuda()
    
            steps += 1
            print("Steps: " + str(steps))

            optimizer.zero_grad()
        
            output = model.forward(images)
            print("Output " + str(output))
            loss = criterion(output, labels)
            print("Loss: " + str(loss))
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            print("Running loss: " + str(running_loss))

            if steps % print_every == 0:

                model.eval()

                # Turn off gradients for validation, saves memory and computations
                with torch.no_grad():
                    validation_loss, accuracy = validation(model, valid_loader, criterion, device)

                print("Epoch: {}/{}.. ".format(e+1, epochs),
                      "Training Loss: {:.3f}.. ".format(running_loss/print_every),
                      "Validation Loss: {:.3f}.. ".format(validation_loss/len(validate_loader)),
                      "Validation Accuracy: {:.3f}".format(accuracy/len(validate_loader)))

                running_loss = 0
                model.train()

Output:

Steps: 1
Output tensor([[-0.8534, -0.5550],
        [-0.8226, -0.5786],
        [-0.9021, -0.5204],
        [-0.5913, -0.8066],
        [-0.7069, -0.6796],
        [-0.6809, -0.7055],
        [-0.7779, -0.6150],
        [-0.9171, -0.5103],
        [-0.7158, -0.6710],
        [-0.6874, -0.6989]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
Loss: tensor(0.7274, device='cuda:0', grad_fn=<NllLossBackward>)
Running loss: 0.7273765206336975

As we can see the loop is working fine until the if statement.

If you are using multiple workers for the DataLoaders, try to use num_workers=0 and rerun the code.
If that’s working, try to run the code as a script from your terminal instead of a notebook.

Let me know, if that doesn’t help.

Ok I changed num_workers to 0 from 2.

It did seem to run past 1 step as it then gave a variable name error inside the if statement (which was my fault). However, that 1 step took like 40+ minutes. I will add a timer now to get an exact time but this script may take days to run it seems. Am I right in thinking that the solution maybe be a Learning Rate Scheduler? Although would this have any effect after just one step?

I also can’t run on the terminal as I am coding from a Mac and can’t use the cuda, hence why I am doing it through the normal Kaggle notebooks, which is also hosting the data too.

Safe to say something is likely wrong as it has been committing for nearly 5 hours now and has still not finished.

Notebook: https://www.kaggle.com/blueturtle/siim-cnn-intro

Edit: It ran for 9 hours before timing out.

Update:

# Train the classifier
def train_classifier(model, optimizer, criterion,train_loader, valid_loader, epochs):

    steps = 0
    print_every = 1

    for e in range(epochs):

        model.train()

        running_loss = 0

        for images, labels in iter(train_loader):            
            images, labels = images.cuda(), labels.cuda()
    
            steps += 1
            print("Steps: " + str(steps))

            optimizer.zero_grad()
        
            output = model.forward(images)
            print("Output " + str(output))
            loss = criterion(output, labels)
            print("Loss: " + str(loss))
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            print("Running loss: " + str(running_loss))

            #model.eval()

            # Turn off gradients for validation, saves memory and computations
            #with torch.no_grad():
            #    validation_loss, accuracy = validation(model, valid_loader, criterion, device)

            #print("Epoch: {}/{}.. ".format(e+1, epochs),
             #     "Training Loss: {:.3f}.. ".format(running_loss/print_every),
              #    "Validation Loss: {:.3f}.. ".format(validation_loss/len(valid_loader)),
               #   "Validation Accuracy: {:.3f}".format(accuracy/len(valid_loader)))

            running_loss = 0
            model.train()

First I removed the if statement to see if the print_every was the problem and it still did not work.
Edit: The problem seems to be with:

            # Turn off gradients for validation, saves memory and computations
            with torch.no_grad():
                validation_loss, accuracy = validation(model, valid_loader, criterion, device)

The validation function is:

def validation(model, valid_loader, criterion, device):
    
    val_loss = 0
    accuracy = 0
    
    for images, labels in iter(valid_loader):

        images, labels = images.to(device), labels.to(device)

        output = model.forward(images)
        val_loss += criterion(output, labels).item()

        probabilities = torch.exp(output)
        
        equality = (labels.data == probabilities.max(dim=1)[1])
        accuracy += equality.type(torch.FloatTensor).mean()
    
    return val_loss, accuracy

Are you sure the problem might be that the validation loop just takes some time, which might look like the script is hanging? :slight_smile:
I’ve added a print statement into the validation method and see it’s quite slow.

Well it took 9 hours yesterday so I’d say something is wrong?

With the validation function the model only does 1 step before hanging for the 9 hours
.
Without the validation function the model takes about 5 seconds per step and runs as normal.

Steps: 1
Output tensor([[-0.6720, -0.7148],
        [-0.6492, -0.7391],
        [-0.8532, -0.5552],
        [-0.5556, -0.8527],
        [-0.7415, -0.6470],
        [-0.5553, -0.8531],
        [-0.6171, -0.7755],
        [-0.7120, -0.6746],
        [-0.7319, -0.6558],
        [-0.8165, -0.5834]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
Loss: tensor(0.6867, device='cuda:0', grad_fn=<NllLossBackward>)
Running loss: 0.6866830587387085
Epoch: 1/5..  Training Loss: 0.687.. 
Steps: 2
Output tensor([[-0.6337, -0.7563],
        [-0.5863, -0.8128],
        [-0.6071, -0.7873],
        [-0.4860, -0.9547],
        [-0.5577, -0.8499],
        [-0.5574, -0.8503],
        [-0.4319, -1.0478],
        [-0.5193, -0.9036],
        [-0.4108, -1.0880],
        [-0.4097, -1.0902]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
Loss: tensor(0.5200, device='cuda:0', grad_fn=<NllLossBackward>)
Running loss: 0.5199903845787048
Epoch: 1/5..  Training Loss: 0.520.. 
Steps: 3
Output tensor([[-0.3847, -1.1416],
        [-0.5190, -0.9042],
        [-0.2826, -1.4018],
        [-0.4019, -1.1059],
        [-0.4460, -1.0222],
        [-0.2860, -1.3915],
        [-0.3044, -1.3378],
        [-0.3103, -1.3215],
        [-0.3084, -1.3265],
        [-0.4343, -1.0432]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
Loss: tensor(0.3677, device='cuda:0', grad_fn=<NllLossBackward>)
Running loss: 0.3677431643009186
Epoch: 1/5..  Training Loss: 0.368.. 
Steps: 4
Output tensor([[-0.1690, -1.8613],
        [-0.3174, -1.3020],
        [-0.2780, -1.4160],
        [-0.3510, -1.2173],
        [-0.3469, -1.2272],
        [-0.2468, -1.5200],
        [-0.3050, -1.3361],
        [-0.3087, -1.3258],
        [-0.2869, -1.3886],
        [-0.2056, -1.6830]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
Loss: tensor(0.2815, device='cuda:0', grad_fn=<NllLossBackward>)
Running loss: 0.2815292477607727
Epoch: 1/5..  Training Loss: 0.282.. 

Sample output without the validation function included.

Try to add a print statement in the validation loss or profile each step.
The Kaggle notebook look approx. 1 second per validation step, so depending on the size of the validation dataset, this might take some time.

Doh. I had set the print_every = 1. So it was running the whole validation function (on ~5000 images) every 1 step. This combined with only having a batch size of 10 meant that for every 10 images (out of 21,200) it was running the whole validation function.

I am still suprised that it couldn’t even run the validation function once in 9 hours though but I have increased the batch size and print_every condition so fingers crossed.