Different outputs of torch.nn.Sequential class and conventional method of using nn.Module

I was usning the following architecture for training my model:

class ConvNet(nn.Module):
  def __init__(self):
    super(ConvNet, self).__init__()
    self.conv1 = nn.Conv2d(9, 27, kernel_size=3, stride=2, padding=0)
    self.conv2 = nn.Conv2d(27, 54, kernel_size=5, stride=1, padding=0)
    self.conv3 = nn.Conv2d(54, 64, kernel_size=6, stride=1, padding=1)
    self.pool1 = nn.MaxPool2d((2,2))
    self.conv4 = nn.Conv2d(64, 128, kernel_size=3,stride=1,padding=1)
    self.conv5 = nn.Conv2d(128, 128, kernel_size=5,stride=2,padding=0)
    self.pool2 = nn.MaxPool2d((2,2))
    self.conv6 = nn.Conv2d(128, 64, kernel_size=3,stride=1,padding=0)
    self.flat = nn.Flatten()
    self.fc1 = nn.Linear(8704,512)
    self.fc2 = nn.Linear(512,4)

  def forward(self,x):
    x = F.relu(self.conv1(x))
    x = F.relu(self.conv2(x))
    x = F.relu(self.conv3(x))
    x = self.pool1(x)
    x = F.relu(self.conv4(x))
    x = F.relu(self.conv5(x))
    x = self.pool2(x)
    x = F.relu(self.conv6(x))
    x = self.flat(x)
    x = F.relu(self.fc1(x))
    x = self.fc2(x)
    return x

and was getting the following output on test set after training:
Screenshot from 2021-12-09 18-44-24

This is not a good prediction, since all the values are same.

But as soon as I use sequential class, with same architecture:

model = torch.nn.Sequential(
          torch.nn.Conv2d(9, 27, kernel_size=3, stride=2, padding=0),
          torch.nn.ReLU(),
          torch.nn.Conv2d(27, 54, kernel_size=5, stride=1, padding=0),
          torch.nn.ReLU(),
          torch.nn.MaxPool2d(2, 2),
          torch.nn.Conv2d(54, 64, kernel_size=6, stride=1, padding=1),
          torch.nn.ReLU(),
          torch.nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
          torch.nn.ReLU(),
          torch.nn.MaxPool2d(2, 2),
          torch.nn.Conv2d(128, 128, kernel_size=3, stride=2, padding=0),
          torch.nn.ReLU(),
          torch.nn.Conv2d(128, 128, kernel_size=3, stride=2, padding=0),
          torch.nn.ReLU(),
          torch.nn.MaxPool2d(2, 2),
          torch.nn.Flatten(),
          torch.nn.Linear(1024,512),
          torch.nn.ReLU(),
          torch.nn.Linear(512,4),
        ).to(device)

the output on test set after training:

[[3.2822053  1.3632872  2.3443744  0.00434288]]
[[ 2.5153766   0.00517752 -0.07172447 -0.04718025]]
[[ 7.12903     0.08480635  2.745857   -0.22266215]]
[[ 5.4862814   0.20487629  2.6381128  -0.03355677]]
[[0.766669  1.0363022 0.2687085 1.1925063]]
[[ 2.292058   0.2214803  2.2952085 -0.1321438]]
[[ 2.5587654   2.8598957   0.08964756 -0.16361943]]
[[ 1.3778323  -0.03345449  2.2093198  -0.11323967]]
[[ 2.2494226e+00  3.7341796e-02 -1.9486643e-02 -2.0913649e-03]]
[[-0.2568683  -0.16061799  0.03884238  2.7568922 ]]
[[ 2.6535883   0.06700002  0.12966754 -0.04277533]]
[[ 1.8448929   0.02760721  0.17743263 -0.07103067]]

I am not able to understand why this is happening since there shouldn’t be any difference between these two methods. I have tried with different architecture as well but the Sequential class was able to perform well on my training dataset while the conventional method was always just messing up.

The model architectures are not equal, so a different behavior is expected.
Some differences after a quick check (model1 refers to the custom nn.Module and model2 to the nn.Sequential):

  • 2 pooling layers in model1, 3 in model2
  • additional torch.nn.Conv2d(128, 128, kernel_size=3, stride=2, padding=0) layer in model2
  • different linear layers (8704 vs. 1024 in_features)
  • 5179586 trainable parameters in model1, 1059074 in model2