Different training loss values when trying to combine two nn.Sequential() into one

I tried to use ResNet to extract features of images. After the baseline network, I add some convolutional layers to fuse some features. But when I try to make my code neater like the following, the Net1 is my original code part and the Net2 is the improved code part.

class Net1(nn.Module):
    def __init__(self, ):
        super(Net1, self).__init__()

        self.fusion = nn.Sequential(nn.Conv2d(512, 256, kernel_size=3, stride=1, padding=1, bias=False),
                                            nn.BatchNorm2d(256),
                                            nn.ReLU(inplace=True),
                                            )

        self.get_feat = nn.Sequential(nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1, bias=False),
                                            nn.BatchNorm2d(256),
                                            nn.ReLU(inplace=True),
                                            )

    def forward(self, x):
       feat = self.get_feat(self.fusion(x))
       return feat


class Net2(nn.Module):
    def __init__(self, ):
        super(Net2, self).__init__()
        self.feat_fusion = nn.Sequential(nn.Conv2d(512, 256, kernel_size=3, stride=1, padding=1, bias=False),
                                            nn.BatchNorm2d(256),
                                            nn.ReLU(inplace=True),
                                            nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1, bias=False),
                                            nn.BatchNorm2d(256),
                                            nn.ReLU(inplace=True),
                                            )


    def forward(self, x):
       feat = self.feat_fusion(x)
       return feat

Although I didn’t change any other configurations of my model, the loss of Net1 was converged about 0.8 and the loss of Net2 was converged about 0.2. Isn’t these two operations are equivalent? And why these losses are different?

Hi,

Although you have defined your models in the same way, they almost will never have same results. The reason is that you have just defined layers, each layer has its own parameters like bias, weights which are initialized randomly and as we know about NNs, it may lead to different local optima. So, if you could get identical loss/results, that would be strange.

Also, optimizers, random seeds in different steps all affect the result. If you want to have a fair comparison, you need to force all random generator seeds to a constant number and then use the exact same input for models without even using loss/optimizer.

For issues related to seeding please see randomness.

If you run below code, you will get same result:

import torch
torch.manual_seed(0)
np.random.seed(0)

class Net1(nn.Module):
    def __init__(self, ):
        super(Net1, self).__init__()

        self.fusion = nn.Sequential(nn.Conv2d(512, 256, kernel_size=3, stride=1, padding=1, bias=False),
                                            nn.BatchNorm2d(256),
                                            nn.ReLU(inplace=True),
                                            )

        self.get_feat = nn.Sequential(nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1, bias=False),
                                            nn.BatchNorm2d(256),
                                            nn.ReLU(inplace=True),
                                            )

    def forward(self, x):
       feat = self.get_feat(self.fusion(x))
       return feat

class Net2(nn.Module):
    def __init__(self, ):
        super(Net2, self).__init__()
        self.feat_fusion = nn.Sequential(nn.Conv2d(512, 256, kernel_size=3, stride=1, padding=1, bias=False),
                                            nn.BatchNorm2d(256),
                                            nn.ReLU(inplace=True),
                                            nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1, bias=False),
                                            nn.BatchNorm2d(256),
                                            nn.ReLU(inplace=True),
                                            )


    def forward(self, x):
       feat = self.feat_fusion(x)
       return feat

torch.manual_seed(0)
net1 = Net1()

torch.manual_seed(0)
net2 = Net2()

torch.manual_seed(0)
x = torch.randn(1, 512, 20, 20)

(net1(x) == net2(x)).all()  # true: two tensors are identical

Bests

Thanks for your response. I used torch.manual_seed()to set random seeds to a constant number, it really works and the training losses are converged to two close values.

But I find a weird thing. Although I didn’t set the random seeds when I converted the combined nn.Sequential() like Net2 to a class based on nn.module(), the training losses of these two operations have a similar trend and converge to close values but are different from using two sequential nn.Sequential() like Net1. I don’t understand why.

Sorry I cannot follow; can you explain more about your experiment with Net1 and Net2?

I have fixed the bug. I forgot to add the weight initialization to the output layer in the training code of Net1. :sweat_smile:It caused this weird training loss trend. When I used torch.manual_seed() and the weight initialization, these three ways are equivalent.

Thank you for your help.

1 Like