CNN Training Loop Not Working

ziad · February 17, 2021, 3:42pm

Hello!

I’ve built a CNN model which I’m now attempting to training using the conventional for-loop. I’ve been having an issue the past few days in which I’m not able to get the for-loop to work properly. Essentially, what happens is that the loop prints the epochs continuously until I have to interrupt.

This is the code I’m using:

epochs = 30
training_loss = []

for epoch in range(epochs):

    running_loss = 0

    #Train
    cnn.train()

    for images, labels in trainloader:
        if cuda:
            images, labels = images.cuda(), labels.cuda()

        optimiser.zero_grad()
        outputs = cnn(images)
        loss = criterion(outputs, labels)
        running_loss += loss.item() * images.size(0)
        loss.backward()
        optimiser.step()
    
        epochs_train_loss = running_loss / len(trainloader.dataset)
        print ("Epoch {}, Training Loss: {}".format(epoch, training_loss))

I’ve tried to place both

epochs_train_loss = running_loss / len(trainloader.dataset)
print ("Epoch {}, Training Loss: {}".format(epoch, training_loss))

out the for-loop but then nothing happens - the cell just runs indefinitely. I’m using a training set of 5000 images.

ptrblck · February 18, 2021, 5:06am

Could you check the length of the trainloader via print(len(trainloader))? This should return the number of batches and iterations in this loop.
Assuming you are using the map-style Dataset, which implements the __len__ method, each epoch should only yield the specified number of samples.

ziad · February 19, 2021, 9:30am

Hey @ptrblck ,

thanks for the reply. print(len(trainloader)) returned 163 - how does this impact my loop?

ptrblck · February 19, 2021, 10:43am

The DataLoader loop should exit after 163 iterations and would be executed epochs times.
Add a print statement to the inner loop and check the current loop index.
Since this loop apparently never exits, it would help to debug this issue.

ziad · February 19, 2021, 12:45pm

Hi @ptrblck - thanks for replying again.

I’ve tried update my code to the following:

num_epochs = 10
count = 0
loss_list = []
iteration_list = [] 
for epoch in range(num_epochs):
    for images, labels in enumerate(trainloader):
        
        # Clear gradients
        optimiser.zero_grad()
        
        # Forward propagation
        outputs = cnn(images)
        
        # Calculate loss
        loss = criterion(outputs, labels)
        
        # Calculating gradients
        loss.backward()
        
        # Update parameters
        optimiser.step()
        
        count += 1
            
            # store loss and iteration
        loss_list.append(loss.data)
        iteration_list.append(count)
            # accuracy_list.append(accuracy)
        if count % 12 == 0:
                # Print Loss
            print('Iteration: {}  Loss: {} %'.format(count, loss.data))

The output is:

Iteration: 12  Loss: 0.3102647066116333 %
Iteration: 24  Loss: 0.5342930555343628 %
Iteration: 36  Loss: 0.13416573405265808 %
Iteration: 48  Loss: 0.05583428218960762 %
Iteration: 60  Loss: 0.3329312801361084 %
Iteration: 72  Loss: 0.1474495381116867 %

I made some changes so that my len(trainloader) is now 82. However, the cell keeps running and running after Iteration: 72 is achieved. I’m not too sure why. Is there some stoppage logic I could use perhaps?

Thanks again

ptrblck · February 19, 2021, 6:24pm

It seems the cell is not running anymore, but hangs instead.
Are you seeing the same issue in a terminal, if you execute the script directly without a notebook?

ziad · February 19, 2021, 9:32pm

Hi @ptrblck

You’re right in that it just hangs after it finishes executing.

I’m not too sure how to go about running this in terminal as I’ve only worked in Notebooks. Interestingly enough, the same thing happens if I used Skorch’s NeuralNetClassifier:

Lastly, could this be related in any way to how I defined the CNN or even to how I’ve pre-processed the images?

ptrblck · February 19, 2021, 11:04pm

It sounds rather like a multiprocessing issue (or code parts are executed after the training, which are not shown in the posted code).
You should be able to export the notebook as a Python script file and run it in a terminal.