Different behaviour while training and evaluating a model on same input (without Batch Norm Layer)

Rahul_Sharma · May 29, 2017, 9:09am

I am trying to train a very simple network on CIFAR10 Dataset

class Net(nn.Module):
    def __init__(self,filters):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, filters, 5,padding=2)
        self.fc1 = nn.Linear(filters * 16 * 16, 10)
    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = x.view(-1, self.num_flat_features(x))
        x = self.fc1(x)
        return x
    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features
def train_random_model(filter,learning_rate,epochs,modelname,dataloader):
    net = Net(filter).cuda()
    net.train()
    criterion = nn.CrossEntropyLoss().cuda()
    optimizer = optim.SGD(net.parameters(), lr=learning_rate, momentum=0)  
    for epoch in range(epochs):  
        running_loss = 0.0
        correct = 0
        total = 0
        for key, value in dataloader.items():
            inputs1, labels1 = value
            inputs, labels = Variable(inputs1.cuda()), Variable(labels1.cuda())
            optimizer.zero_grad()
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.data[0]
            _, predicted = torch.max(outputs.data, 1)
            total += labels1.size(0)
            labels1=labels1.cuda()
            corr_pred= (predicted == labels1).sum()
            correct += corr_pred
        trainerror=100.0 * (total-correct) / total
        print('Finished Training',epoch,"Train Error",trainerror)

        if(trainerror<=1.0):
            break
    print("Saving model at",modelname)
    torch.save(net,modelname)
    print("Sanity Check")
    correct=0
    total=0
    net.eval()
    for key, value in dataloader.items():
                inputs1, labels1 = value
                inputs, labels = Variable(inputs1.cuda()), Variable(labels1.cuda())
                outputs = net(inputs)   
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                labels1=labels1.cuda()
                correct += (predicted == labels1).sum()
    trainerror=100.0 * (total-correct) / total
    print("Train Error",trainerror,count,total,correct)

Unfortunately final train error printed while training is very different from train error computed while evaluating.
Do you have any idea what might be causing this ?
I understand Batch Normalization typically causes these kind of errors but i am not using that layer.

mratsim · May 29, 2017, 9:16am

How many epochs did you train the network for?

I think this is just overfitting. Normally you should use the validation error only and never look at the training error. When your validation error does not improve you stop training (that’s called early stopping).

The only time you look at the training error is at the very beginning to make sure that your network is able to overfit your dataset (i.e. learn to represent completely the dataset).
If it does not overfit, change your network or preprocess your data because your network isn’t able to learn anything.
If your network is able to overfit, great, now you can focus on having it learn a general representation. You discard that training error and only focus on validation error which corresponds to unseen data.

Rahul_Sharma · May 29, 2017, 9:17am

Yeah . I was able to identify the issue as the network’s training error was computed before the gradient update was done and i was evaluating its performance after the updates

Rahul_Sharma · May 29, 2017, 9:23am

I don’t quite understand what you mean by overfitting. i am giving the input the training images and labels of CIFAR 10 dataset and in the next section i am just trying to evaluate train accuracy manually by doing forward pass on the learnt network

mratsim · May 29, 2017, 1:56pm

I thought you were passing new unseen data to the network. It is expected to have a (sometimes big) difference between train and validating error.

However if you are passing the same data you should get almost the same score

cold_wind · July 5, 2017, 8:37am

Was this problem dealt with? I met the same problem

Rahul_Sharma · July 5, 2017, 10:54am

The model is constantly being updated in each SGD Step so running train error will not be equal to final train error at the end of epoch.