Why my train and valid Loss keep growing as the epochs go on?

Sam_Fst · July 24, 2019, 10:25pm

With each epochs my train loss increases and I don’t know where the error in the training code is, does anyone have any ideas?

for e in range(epochs):
# keep track of training and validation loss
train_loss = 0.0
valid_loss = 0.0
running_loss = 0.0
running_corrects = 0.0

###################
# TRAIN THE MODEL #
###################
model.train()
cont = 0
for inputs, label in (dataloaders['train']):
    # IF GPU is availible
    if train_on_gpu:
        inputs, label = inputs.cuda(), label.cuda()
    
    #inputs, labels = i ,data
    
    
    optimizer.zero_grad()

    with torch.set_grad_enabled(True):
        logps = model(inputs)
        _, preds = torch.max(logps, 1) # tecnica nova de validacao
        loss = criterion(logps, label)
        loss.backward()
        optimizer.step()
    #*inputs.size(0)
    running_loss += loss.item()
    print("running_loss = %f , interaction = %i " % (running_loss,cont) )
    running_corrects += torch.sum(preds == label.data)
    RL_vector.append(running_loss)
    cont += 1
    
    
###################
# VALID THE MODEL #
###################
model.eval()
for inputs, label in (dataloaders['valid']):
    # IF GPU is availible
    if train_on_gpu:
        inputs, label = inputs.cuda(), label.cuda()
    
    with torch.no_grad():
        logps = model(inputs)
        _, preds = torch.max(logps, 1) # tecnica nova de validacao
        loss = criterion(logps, label)
        # update average validation loss 
    valid_loss += loss.item()
    VL_vector.append(valid_loss)

# calculate average losses

epoch_loss_train = running_loss / dataset_sizes['train']
epoch_acc_train = running_corrects.double() / dataset_sizes['train']

epoch_loss_valid = valid_loss / dataset_sizes['valid']


print('{} Loss: {:.4f} \tAcc: {:.4f}'.format('train', epoch_loss_train, epoch_acc_train))
print('{} \tLoss: {:.4f} '.format('valid', epoch_loss_valid))

Capturar

ptrblck · July 24, 2019, 11:05pm

Both losses are decreasing, which is generally fine.
Which criterion are you using and what kind of use case are you currently working on, as the range looks a bit different.

Sam_Fst · July 25, 2019, 1:14pm

I fix the problem. I put model.fc = classifier in my models.densenet121 and I thinks this is the source of the erro. But I do not know why , what is the difference between model.fc and model.classifier ?

ptrblck · July 25, 2019, 9:36pm

model.fc and model.classifier are just the internal names for some submodules.
If you print the model, you’ll find the name of the last layer, which you could replace with a custom one for your use case:

model = models.densenet121()
print(model)

In this example, the densenet121 uses the attribute name classifier for the last nn.Linear layer, so you should use this attribute name.
If you just assign a custom linear layer to model.fc it won’t be used and trained without changing the forward method.