Large difference of loss while having same dataset

I was working on the project ,however it just doesn’t work out, the situation is that I fit the model and then test it with same data , the difference of loss is quite huge…I have no idea why this is happening…I would provide as much as information I can,Thanks for all your patience firstly!
The whole picture of my train.py :

transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.485,0.456,0.406), (0.229,0.224,0.225))])
#in order to have same dataset so I put the same data path for train and test.
training_set = Dataloader(root = args.root,list_file = args.files,transform =transform,input_size=600)
training_set_loader = torch.utils.data.DataLoader(training_set,batch_size = args.batch_size ,shuffle=False,collate_fn = training_set.collate_fn)

test_set = Dataloader(root =args.root ,list_file = args.files ,transform =transform,input_size=600)
test_set_loader = torch.utils.data.DataLoader(test_set ,batch_size = args.batch_size ,shuffle = False,collate_fn = test_set.collate_fn)

#some model setting 
model = Retina_Net(args.num_class)
weights = torch.load(join("weights","retinet.pth"))
model.load_state_dict(weights)
model.cuda()
def train(epoch,mode=True,beta=0.99):
    model.train(mode)
    global training_iterations
    global avg_loss
    for num_batch,(inputs,loc_targets,cls_targets) in enumerate(training_set_loader):
        inputs = inputs.cuda()
        loc_targets = loc_targets.cuda()
        cls_targets = cls_targets.cuda()
        optimizer.zero_grad()
        optimizer.param_groups[0]["lr"] = lr_distr[training_iterations]
        optimizer.param_groups[0]["betas"] = (mom_distr[training_iterations].item(),0.999)
        loc_preds,cls_preds = model(inputs)
        loss = loss_function(loc_preds, loc_targets, cls_preds, cls_targets)
        #back prop
        loss.backward()
        #update parameters
        optimizer.step()
        avg_loss = loss.item()*(1-beta)+avg_loss*beta
        smooth_loss = avg_loss/(1-(beta**(training_iterations+1)))
        training_iterations+=1
def test(epoch,mode = False,beta = 0.99):
    model.eval()
    global avg_test_loss
    global testing_iterations
    with torch.no_grad():
        for num_batch,(inputs,loc_targets,cls_targets) in enumerate(test_set_loader):
            inputs = Variable(inputs.cuda())
            loc_targets = Variable(loc_targets.cuda())
            cls_targets = Variable(cls_targets.cuda())
            loc_preds,cls_preds = model(inputs)
            loss= loss_function(loc_preds, loc_targets, cls_preds, cls_targets)
            avg_test_loss = loss.item()*(1-beta) + avg_test_loss*beta
            smooth_test_loss = avg_test_loss/(1-beta**(1+testing_iterations))

And I take care of Batch normalization layer really carefully ,I freeze the pre-trained model Resnet50’s Batch normalization layers.So I rewrite the train() function,where in this case fpn is Resnet50.

def train(self,mode):
    super().train(mode)
    for m in self.fpn.modules():
        if isinstance(m,nn.BatchNorm2d):
            m.weight.requires_grad= False
            m.bias.requires_grad= False
            m.eval()
        elif isinstance(m,nn.Conv2d):            
            m.weight.requires_grad= False

And I train 10 epochs as a little test…However the result is just unexpectedly terrible…

epochs 1/10 [728/726 (100%)] loss: 0.2716
testing [728/726 (100%)] loss: 70.851
epochs 2/10 [728/726 (100%)] loss: 0.2685
testing [728/726 (100%)] loss: 389.855
epochs 3/10 [728/726 (100%)] loss: 0.1624
testing [728/726 (100%)] loss: 148.729
epochs 4/10 [728/726 (100%)] loss: 0.0772
testing [728/726 (100%)] loss: 305.008
epochs 5/10 [728/726 (100%)] loss: 0.0387
testing [728/726 (100%)] loss: 211.030
epochs 6/10 [728/726 (100%)] loss: 0.0283
testing [728/726 (100%)] loss: 174.605
epochs 7/10 [728/726 (100%)] loss: 0.0249
testing [728/726 (100%)] loss: 157.776
epochs 8/10 [728/726 (100%)] loss: 0.0234
testing [728/726 (100%)] loss: 148.920
epochs 9/10 [728/726 (100%)] loss: 0.0224
testing [728/726 (100%)] loss: 145.561
epochs 10/10 [728/726 (100%)] loss: 0.0217
testing [728/726 (100%)] loss: 144.784

Because the training loss is decreasing ,so I assume the model and the algorithm is correct.I suspect tht there’s are some issues with Batch normalization layer…If there’s information that is needed to be provided to work this out ,let me know.This really bothers me for a long time …Thanks in advance!

First, this is a very good sanity check to run!

I would double check that this isn’t dataset issue. Can you try model.train(mode=True) in your test method? I suspect you’ll get sensible results, but it’s good to make sure before proceeding. Then we’ll know it’s an issue with calling .eval()

I admit, this at first glance is very mysterious. You don’t happen to have dropout layers? They are known to exert a ‘variance shift’ that can affect batchnorm layers.

A small note: The Variable() syntax is depreciated it’s no longer necessary.

Sorry I’m not of much help right now, but I’ll think about it.

Thanks for your reply ,my model doesn’t have dropout layer.I use model.train() at test time ,the result is still terrible…And I also print the data for both train and test ,and they are exactly the same.It’s so frustrating…
@albanD,Sorry for bothering you again…May you please help us out here,Thanks!

I think one of the reason here is the batch normalization layers ,the parameters momentum is 0.01,However ,this example only runs 1820 iterations ,I’m not sure except the pre-trained part(frozen),the running mean of bn layers would estimated correctly…

It does look unexpected.
One comment I would make is that you most likely have bias in your convolutions? Did you forgot to freeze them as well?

So is your problem that your model is very severely overfiting to the training set?

@albanD @aauker
Thanks for all your reply, my pre-trained model is resnet50 which I think there’s no bias.

So is your problem that your model is very severely overfiting to the training set?

Actually ,I just wanted to check what model’s prediction looks like in testing set .However ,the result was just terrible.As I saw my training loss decrease beautifully ,so I decided to predict the image in the training set instead of test set. And I found two possible issues to see if you guys agree:

1.Like I mentioned above ,Since I have bn layers going on while training,the estimation of running mean&var with momentum =0.01 is not accurate, and during the test time,the running mean &var would be used.When I set it to 0.15 or even 0.2,the result of testing is actually better as the table 1 in this paper: Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates (look at the BN MAF).In my case 10 epochs for testing has less than 2000 iterations.

2.As I printed out the cls_loss and loc_loss for object detection ,and I realized that my cls_loss is not decreasing …So I think there’re some others issues in my code.

I guess I would wrap this up here,haha…the repo that I tried to dig into turns out leaving some problems there…Maybe next time make sure the code works(having good result) before I really dig into it.

I use model.train() at test time ,the result is still terrible…

This is definitely problematic. It means that either your “test” (but really train) data is wrong, or some internal issue with how the loss is calculated.

As I printed out the cls_loss and loc_loss for object detection ,and I realized that my cls_loss is not decreasing …So I think there’re some others issues in my code.

This certainly seems like a key to problem.

I think one of the reason here is the batch normalization layers ,the parameters momentum is 0.01,However ,this example only runs 1820 iterations ,I’m not sure except the pre-trained part(frozen),the running mean of bn layers would estimated correctly…

If you run with batchnorm layers in .eval() mode (which your code suggests you are), the running mean and variance are not updated. You can update only the mean and variance by running in .train() mode and freezing the gradients, which means the affine parameters are not updated.

I think my data is correct and the all steps of processing has been checked repeatedly several times…and my model does have some bn layers besides the ones in the pre-trained model(Resnet50).