Influence of Unused FFN on Model Accuracy in PyTorch

I am encountering a peculiar issue with my PyTorch model where the presence of an initialized but unused FeedForward Network (FFN) affects the model’s accuracy. Specifically, when the FFN is initialized in my CRS_A class but not used in the forward pass, my model’s accuracy is higher compared to when I completely remove (or comment out) the FFN initialization. The FFN is defined as follows in my model’s constructor:

class CRS_A(nn.Module):
    def __init__(self, modal_x, modal_y, hid_dim=128, d_ff=512, dropout_rate=0.1):
        super(CRS_A, self).__init__()

        self.cross_attention = CrossAttention(modal_y, modal_x, hid_dim)
        self.ffn = nn.Sequential(
            nn.Conv1d(modal_x, d_ff, kernel_size=1),
            nn.GELU(),
            nn.Dropout(dropout_rate),
            nn.Conv1d(d_ff, 128, kernel_size=1),
            nn.Dropout(dropout_rate),
        )
        self.norm = nn.LayerNorm(modal_x)
       
        self.linear1 = nn.Conv1d(1024, 512, kernel_size=1)
        self.linear2 = nn.Conv1d(512, 300, kernel_size=1)
        self.dropout1 = nn.Dropout(0.1)
        self.dropout2 = nn.Dropout(0.1)
    def forward(self, x, y, adj):
        x = x + self.cross_attention(y, x, adj)  #torch.Size([5, 67, 1024])
        x = self.norm(x).permute(0, 2, 1)
        x = self.dropout1(F.gelu(self.linear1(x))) #torch.Size([5, 512, 67])
        x_e = self.dropout2(F.gelu(self.linear2(x))) #torch.Size([5, 300, 67])

        return x_e, x

As you can see, the self.ffn is not used in the forward pass. Despite this, removing or commenting out the FFN’s initialization leads to a noticeable drop in accuracy.

Could this be due to some form of implicit regularization, or is there another explanation for this behavior? Has anyone encountered a similar situation, and how did you address it? Any insights or explanations would be greatly appreciated.

Initializing modules calls into the pseudorandom number generator and will thus have an effect on the overall training. You should be able to see a similar effect when changing the random seed and rerunning the training procedure.

Thank you for your response. Yes you are true, if I change seed or rerun I can see accuracy changes with ffn initialization. what is the recommended solution here?

You could change some hyperparameters, such as the learning rate or the parameter initialization, and check if this would stabilize the training.