Behaviour in training and inference quite different

Code for model:

class ratingModel(nn.Module):
    def __init__(self, nhead, dim_model, dim_ff):
        super(ratingModel, self).__init__()
        self.num_head = nhead
        self.dim_model = dim_model
        self.dim_feedforward = dim_ff
        self.decoder_layer = TransformerDecoderLayer(self.dim_model, self.num_head, self.dim_feedforward)
        self.linear_layer = nn.Linear(self.dim_model, 2)
        
    def forward(self, query_em, item_em):
        dec_out = self.decoder_layer(tgt=query_em, memory=item_em)
        dec_out = dec_out.squeeze(0)
        ll_out = self.linear_layer(dec_out)
        x = F.log_softmax(ll_out)
        return x

Code for training loop looks like

EPOCHS = 1
train_step = len(train_dataloader)
val_step = len(val_dataloader)
for epoch in range(EPOCHS):
    epoch_loss = 0
    correct = 0
    total = 0
    predictions = []
    cnt=0
    for query_em, item_em, label in train_dataloader:
        cnt+=1
        query_em, item_em, label  = query_em.to(device), item_em.to(device), label.to(device)
        query_em = query_em.unsqueeze(0)
        item_em = item_em.unsqueeze(0)
        output = model(query_em, item_em)
        loss = criterion(output, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        #pdb.set_trace()
        epoch_loss += loss.item()
        _,pred = torch.max(output, dim=1)
        predictions.append(pred)
        correct += torch.sum(pred==label).item()
        total += label.size(0)
        if cnt%1000==0: 
            print(f'training loss: {epoch_loss/cnt}, training_acc:{correct/total}')
            #pdb.set_trace()
    print(f'training loss: {epoch_loss/train_step}, training_acc:{correct/total}')
    model.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        epoch_loss = 0
        for query_em, item_em, label in val_dataloader:
            query_em, item_em, label  = query_em.to(device), item_em.to(device), label.to(device)
            query_em = query_em.unsqueeze(0)
            item_em = item_em.unsqueeze(0)
            output = model(query_em, item_em)
            loss = criterion(output, label)
            epoch_loss += loss.item()
            _,pred = torch.max(output, dim=1)
            correct += torch.sum(pred==label).item()
            total += label.size(0)
    print(f'validation loss: {epoch_loss/val_step}, validation acc:{correct/total}')
    #for param in model.parameters():
    #    print(param.data)

the output of training loop is pretty promising as can be seen in figure below:

However when I run the inference code below on same training data:

    predictions_later = []
    labels = []
    model.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        epoch_loss = 0
        for query_em, item_em, label in train_dataloader:
            query_em, item_em, label  = query_em.to(device), item_em.to(device), label.to(device)
            query_em = query_em.unsqueeze(0)
            item_em = item_em.unsqueeze(0)
            output = model(query_em, item_em)
            loss = criterion(output, label)
            epoch_loss += loss.item()
            _,pred = torch.max(output, dim=1)
            correct += torch.sum(pred==label).item()
            total += label.size(0)
            predictions_later.append(pred)
            labels.append(label)
    print(f'validation loss: {epoch_loss/val_step}, validation acc:{correct/total}')

The results is pretty subdued accuracy and moreover it’s just predicting all the entry to be of 0 class.
validation loss: 138.1207146283448, validation acc:0.20694013824686563

I am new to pytorch so i might be missing something obvious here. Using pytorch ‘1.9.0’ with cuda ‘11.1’. I am training model on GPU.

Based on the posted code snippets the first training and validation loops are identical to the “inference” code besides that the inference code uses the train_dataloader. It also seems you are not calling model.train() in your training loop. but based on your model definition this might not be necessary as the used layers do not change their behavior during training/validiation.

If my assumption is correct, I would try to reduce the use case a bit and check the predictions with a single sample to make sure the original training predictions are matching the inference code. My guess would be that the model loading might have failed in your inference code and you might be using an untrained model.

Thanks for your reply.

  1. As transformerdecoder layer has dropout within it I trained my model with model.train(). But result was same.
  2. Your suspicion is correct. It is giving different results on the same training data. To test that only I am running inference with train_dataloader.
  3. Don’t understand why model loading is failing. I am calling the training and inference code in the same jupyter notebook. I also tried (after your comment) saving the model after training (in the same jupyter cell in which training is happening) and then loading it later for inference. But even that isn’t working.
torch.save(model.state_dict(), './dl_class.pth')
model = ratingModel(nhead=8, dim_model=384, dim_ff= 2048)
model.load_state_dict(torch.load('./dl_class.pth'))

One more piece of update. I compared the weight matrices (only one of the layers but should be enough to test equality) of the model at the end of training and at the start of the inference code, they are exactly equal.

I don’t quite understand the issue as:

doesn’t explain what is creating the different outputs.
If I understand your use case correctly you are comparing:

  • training iteration using a fixed input in model.eval()
  • vs. the exact same setup but this time in the “inference loop”?

Hi ptrblck,
Sorry if I wasn’t clear enough. But you have understood correctly. the only difference in “training iteration” and “inference loop” is I am not backpropagating losses and doing “optimizer.step()” in “inference loop”. Otherwise they are the same. First I am running training loop then inference loop. (loop just denotes looping over data once. same data (train_dataloader) is being used in both cases).