Converted Keras to Pytorch starts out well then accuracy stops increasing and loss doesn't change much

HI guys Pytorch newby here :smile:

I have translated one of my models from TF Keras to Pytorch, the model matches exactly. I have used custom data augmentation that I have used with my Keras model for a number of years.

I have tested the shape x after each layer in forward and they are correct, they match the original model. At first the model seems to do quite well loss steadily decreases and accuracy slowly increases, then it hits a road block, accuracy stops increasing, and eventually loss starts going crappy.

Can anyone suggest what is happening, this model in Keras does extremely well on the augmented dataset.

Below is the network architecture and the output from training.

Net(
  (pad1): ZeroPad2d(padding=(2, 2, 2, 2), value=0.0)
  (conv1): Conv2d(3, 30, kernel_size=(5, 5), stride=(1, 1))
  (pad2): ZeroPad2d(padding=(2, 2, 2, 2), value=0.0)
  (conv2): Conv2d(30, 30, kernel_size=(5, 5), stride=(1, 1))
  (pool3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (fc1): Linear(in_features=75000, out_features=128, bias=True)
  (relu): ReLU()
  (softmax): Softmax(dim=None)
)

Output:

Train Loss: 4.850 | Accuracy: 21.895
Train Loss: 4.847 | Accuracy: 29.684
Train Loss: 4.840 | Accuracy: 29.684
Train Loss: 4.830 | Accuracy: 29.895
Train Loss: 4.814 | Accuracy: 30.526
Train Loss: 4.787 | Accuracy: 30.947
Train Loss: 4.742 | Accuracy: 32.000
Train Loss: 4.677 | Accuracy: 34.737
Train Loss: 4.590 | Accuracy: 64.632
Train Loss: 4.494 | Accuracy: 70.316
Train Loss: 4.398 | Accuracy: 70.316
Train Loss: 4.301 | Accuracy: 70.316
Train Loss: 4.231 | Accuracy: 70.316
Train Loss: 4.199 | Accuracy: 70.316
Train Loss: 4.184 | Accuracy: 70.316
Train Loss: 4.178 | Accuracy: 70.316
Train Loss: 4.167 | Accuracy: 70.316
Train Loss: 4.166 | Accuracy: 70.316
Train Loss: 4.166 | Accuracy: 70.316
Train Loss: 4.161 | Accuracy: 70.316
Train Loss: 4.167 | Accuracy: 70.316
Train Loss: 4.161 | Accuracy: 70.316
Train Loss: 4.163 | Accuracy: 70.316
Train Loss: 4.164 | Accuracy: 70.316
Train Loss: 4.164 | Accuracy: 70.316
Train Loss: 4.161 | Accuracy: 70.316
Train Loss: 4.163 | Accuracy: 70.316
Train Loss: 4.165 | Accuracy: 70.316
Train Loss: 4.159 | Accuracy: 70.316
Train Loss: 4.165 | Accuracy: 70.316
Train Loss: 4.167 | Accuracy: 70.316
Train Loss: 4.168 | Accuracy: 70.316
Train Loss: 4.163 | Accuracy: 70.316
Train Loss: 4.170 | Accuracy: 70.316
Train Loss: 4.163 | Accuracy: 70.316
Train Loss: 4.160 | Accuracy: 70.316
Train Loss: 4.162 | Accuracy: 70.316
Train Loss: 4.162 | Accuracy: 70.316
Train Loss: 4.166 | Accuracy: 70.316
Train Loss: 4.161 | Accuracy: 70.316
Train Loss: 4.160 | Accuracy: 70.316
Train Loss: 4.165 | Accuracy: 70.316
Train Loss: 4.165 | Accuracy: 70.316
Train Loss: 4.168 | Accuracy: 70.316
Train Loss: 4.161 | Accuracy: 70.316
Train Loss: 4.165 | Accuracy: 70.316
Train Loss: 4.164 | Accuracy: 70.316
Train Loss: 4.161 | Accuracy: 70.316
Train Loss: 4.161 | Accuracy: 70.316
Train Loss: 4.165 | Accuracy: 70.316
Train Loss: 4.165 | Accuracy: 70.316
Train Loss: 4.164 | Accuracy: 70.316
Train Loss: 4.164 | Accuracy: 70.316
Train Loss: 4.160 | Accuracy: 70.316
Train Loss: 4.157 | Accuracy: 70.316
Train Loss: 4.158 | Accuracy: 70.316
Train Loss: 4.170 | Accuracy: 70.316
Train Loss: 4.159 | Accuracy: 70.316
Train Loss: 4.159 | Accuracy: 70.316
Train Loss: 4.165 | Accuracy: 70.316
Train Loss: 4.164 | Accuracy: 70.316

Thanks in advance.

In addition to the shapes and loss function, is there any difference in the learning rate schedule and other hyperparameters (e.g., batch size) between the setups?

Thank you for the quick response I am actually just checking that now, please give me a few moments.

OK it must be the batch size, in Keras I do not specify the batch size it is handled dynamically, if I remove batch size in Pytorch it is a lot worse:

Train Loss: 4.564 | Accuracy: 30.105
Train Loss: 4.564 | Accuracy: 30.105
Train Loss: 4.564 | Accuracy: 30.105
Train Loss: 4.564 | Accuracy: 30.105

So in TF when no batch size is provided it defaults to 32, I am now testing with 32 batch size in Pytorch, all other hyper parameters are the same, let’s see how it goes.

Not good:

Train Loss: 4.372 | Accuracy: 54.737
Train Loss: 4.167 | Accuracy: 69.895
Train Loss: 4.167 | Accuracy: 69.895
Train Loss: 4.166 | Accuracy: 69.895
Train Loss: 4.167 | Accuracy: 69.895
Train Loss: 4.168 | Accuracy: 69.895
Train Loss: 4.167 | Accuracy: 69.895
Train Loss: 4.167 | Accuracy: 69.895

Everything is exactly the same now and it has made it worse.

batch_size = 32
learning_rate = 1e-4
decay = 1e-6
seed = 2

nn.CrossEntropyLoss()
optim.Adam(net.parameters(), lr=learning_rate, weight_decay=decay)

Train Loss: 4.848 | Accuracy: 5.459
Train Loss: 4.806 | Accuracy: 31.017
Train Loss: 4.659 | Accuracy: 29.777
Train Loss: 4.401 | Accuracy: 70.471
Train Loss: 4.188 | Accuracy: 70.471
Train Loss: 4.149 | Accuracy: 70.471
Train Loss: 4.159 | Accuracy: 70.471
Train Loss: 4.164 | Accuracy: 70.471
Train Loss: 4.164 | Accuracy: 70.471

Very strange getting different results every time, I can no longer replicate the first results I shared and these were the best the model was. I have seeded, tested different batch sizes, different learning rates, different decays, the model is doing really bad.

It seems to always get to around 70 where it levels off.

Train Loss: 4.847 | Accuracy: 26.551
Train Loss: 4.819 | Accuracy: 62.779
Train Loss: 4.692 | Accuracy: 70.471
Train Loss: 4.393 | Accuracy: 70.471
Train Loss: 4.197 | Accuracy: 70.471
Train Loss: 4.173 | Accuracy: 70.471

Hello there, would anyone have any suggestions on this issue?

You might want to check if the predictions of the model make sense at the end, or if they seem to be stuck between an unexpected range of values. (e.g., some basic sanity checks would be that all classes are represented, different inputs cause different predictions, etc.)

Thanks for the reply. So leave it to complete training despite the accuracy not increasing and then test it on unseen data ?

If you haven’t tested it on unseen data yet, you might also consider the sanity check of testing that it can overfit (reach 100% accuracy) on a very small training set (e.g., a single batch).

Thanks I haven’t been able to resolve it, I have switched back to Tensorflow as this project is time sensitive and I can train this model on any device and architecture with TF. Thanks for the help.

I am refusing to admit defeat as I really want to use PyTorch for this project, I have tested this very same model using Tensorflow on 4 different machines and I get exactly the same results, always performs well with accuracy of around 93 - 98% and on unseen data it generally only gets 1 -3 false classifications.

Could you see if there is anything wrong with my code please as no matter what I try I am always getting this issues.

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.pad1 = nn.ZeroPad2d(padding=2)
        self.conv1 = nn.Conv2d(3, 30, kernel_size=5, stride=1)
        
        self.pad2 = nn.ZeroPad2d(padding=2)
        self.conv2 = nn.Conv2d(30, 30, kernel_size=5, stride=1)
        
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        
        self.flatten = nn.Flatten()
        
        self.fc1 = nn.Linear(75000, 128)
        
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax()

    def forward(self, x):
        x = self.pad1(x)
        x = self.relu(self.conv1(x))
        
        x = self.pad2(x)
        x = self.relu(self.conv2(x))
        
        x = self.pool3(x)
        x = self.flatten(x)
        x = self.fc1(x)
        
        x = self.softmax(x)
        return x


net = Net()
print(net)
train_size = int(split * len(dataset))
test_size = len(dataset) - train_size

train_set, validation_set = torch.utils.data.random_split(dataset,[train_size, test_size])

print(len(train_set))
print(len(validation_set))

train_loader = DataLoader(dataset=train_set, shuffle=shuffle, batch_size=batch_size,
                          num_workers=num_workers, pin_memory=pin_memory)

validation_loader = DataLoader(dataset=validation_set, shuffle=shuffle, batch_size=batch_size,
                               num_workers=num_workers, pin_memory=pin_memory)
import torch.optim as optim

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=learning_rate, weight_decay=decay)
train_accu = []
train_losses = []

for epoch in range(epochs): 

    net.train()
    
    running_loss=0
    correct=0
    total=0
    
    for i, data in enumerate(train_loader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = loss_fn(outputs, labels.long())
        loss.backward()
        optimizer.step()
 
        running_loss += loss.item()

        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
       
    train_loss=running_loss/len(train_loader)
    accu=100.*correct/total

    train_accu.append(accu)
    train_losses.append(train_loss)
    print('Train Loss: %.3f | Accuracy: %.3f'%(train_loss,accu))

print('Finished Training')

Results:

Train Loss: 4.847 | Accuracy: 26.551
Train Loss: 4.819 | Accuracy: 62.779
Train Loss: 4.692 | Accuracy: 70.471
Train Loss: 4.393 | Accuracy: 70.471
Train Loss: 4.197 | Accuracy: 70.471
Train Loss: 4.173 | Accuracy: 70.471
Train Loss: 4.161 | Accuracy: 70.471
Train Loss: 4.157 | Accuracy: 70.471
Train Loss: 4.159 | Accuracy: 70.471
Train Loss: 4.158 | Accuracy: 70.471
Train Loss: 4.156 | Accuracy: 70.471
Train Loss: 4.156 | Accuracy: 70.471
Train Loss: 4.164 | Accuracy: 70.471
Train Loss: 4.164 | Accuracy: 70.471
Train Loss: 4.157 | Accuracy: 70.471
Train Loss: 4.163 | Accuracy: 70.471
Train Loss: 4.157 | Accuracy: 70.471
Train Loss: 4.157 | Accuracy: 70.471
Train Loss: 4.157 | Accuracy: 70.471
Train Loss: 4.167 | Accuracy: 70.471
Train Loss: 4.157 | Accuracy: 70.471
Train Loss: 4.157 | Accuracy: 70.471
Train Loss: 4.157 | Accuracy: 70.471
Train Loss: 4.160 | Accuracy: 70.471
Train Loss: 4.160 | Accuracy: 70.471
Train Loss: 4.158 | Accuracy: 70.471
Train Loss: 4.160 | Accuracy: 70.471
Train Loss: 4.163 | Accuracy: 70.471
Train Loss: 4.158 | Accuracy: 70.471
Train Loss: 4.160 | Accuracy: 70.471
Train Loss: 4.161 | Accuracy: 70.471
Train Loss: 4.161 | Accuracy: 70.471
Train Loss: 4.165 | Accuracy: 70.471
Train Loss: 4.165 | Accuracy: 70.471
Train Loss: 4.170 | Accuracy: 70.471
Train Loss: 4.156 | Accuracy: 70.471
Train Loss: 4.156 | Accuracy: 70.471
class AllIdbDataset(Dataset):
    """
    Pytorch dataset class

    Args:
        csv_path (string): path to CSV
        img_path (string): path to the training folder
        transform: Pytorch transforms
    """
    
    def __init__(self, data_dir, csv_path, transform=None):
        
        self.data_dir = data_dir
        self.data = pd.read_csv(csv_path)
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        img_id = self.data.iloc[index, 0]
        img = Image.open(os.path.join(self.data_dir, img_id)).convert("RGB")
        label = torch.tensor(float(self.data.iloc[index, 1]))

        if self.transform is not None:
            img = self.transform(img)

        return (img, label)

It must be something noob I am doing, there is no way that this classifier is doing this bad without a mistake I made myself.

I am not using transforms except for converting to tensor, but using custom data augmentation that again has been used in my projects for years for with no issues.

Maybe you can see something I am missing.

Sorry forgot hyper parameters:



batch_size = 32
decay = 1e-6
learning_rate = 0.00001
epochs = 100
num_workers = 1
pin_memory = True
rotations = 10
shuffle = True
split = 0.255

Have you tried the previous suggestion of starting with a very small training set and seeing if your model can perfectly overfit it? That should help narrow down the issue (e.g., if there is a correctness issue in the training pipeline it would prevent the model from training properly at this stage, vs. a hyperparameter issue which would likely still allow the model to overfit a small training set).

Additionally to the other suggestions, remove the self.softmax operation, since nn.CrossEntropyLoss expects raw logits and applies F.log_softmax internally.