Two equivalent way of building neural network but gives very different results

ElleryL · January 2, 2019, 9:00am

I intend to build a neural network with 4 layers and each layers has 16 units

My first approach is

        h_dim = 256
        x_dim = 2
        nets = lambda : nn.Sequential(nn.Linear(x_dim, h_dim), 
                                      nn.LeakyReLU(),
                                      nn.Linear(h_dim, h_dim), 
                                      nn.LeakyReLU(), 
                                      nn.Linear(h_dim, x_dim), 
                                      nn.Tanh())
        
        nett = lambda : nn.Sequential(nn.Linear(x_dim, h_dim), 
                                      nn.LeakyReLU(), 
                                      nn.Linear(h_dim, h_dim), 
                                      nn.LeakyReLU(), 
                                      nn.Linear(h_dim, x_dim))

        self.t = torch.nn.ModuleList([nett() for _ in range(num_network)])
        self.s = torch.nn.ModuleList([nets() for _ in range(num_network)])

My second approach is

        hidden_layers = [256,256]
        t_block = []
        hs = [x_dim] + hidden_layers + [x_dim]
        for h0, h1 in zip(hs, hs[1:]):
            t_block.extend([
                nn.Linear(h0, h1),
                nn.LeakyReLU(),
            ])
        t_block.pop()  # pop the last ReLU for the output layer
        nett = nn.Sequential(*t_block)
        
        s_block = []
        for h0, h1 in zip(hs, hs[1:]):
            s_block.extend([
                nn.Linear(h0, h1),
                nn.LeakyReLU(),
            ])
        s_block.pop()  # pop the last ReLU for the output layer
        s_block.extend([nn.Tanh()])
        nets = nn.Sequential(*s_block)

        self.t = torch.nn.ModuleList([nett for _ in range(num_network)])
        self.s = torch.nn.ModuleList([nets for _ in range(num_network)])

These two approach are equivalent; however, across several experiments; I found second approach has much worse training and test results than the first one (Even though I have force the initialization to be same). The difference is non-trivial, can anyone explain the reason behind this ?

ptrblck · January 3, 2019, 7:45pm

The number of hidden neurons is different in both approaches.
While the first one uses 16 neurons, the second one uses 256.

ElleryL · January 3, 2019, 8:03pm

Sorry; it’s a typo; both use 256

ptrblck · January 4, 2019, 3:06pm

In that case, both methods yield the same outputs.
How did you force the initialization to be the same?
I’ve created the models using both approaches and loaded the state_dicts from the first approach to the second one, and get identical results.

How many times did you run your code? Are you sure the difference is not due to some randomness in your initialization? Note that even though the modules have the same default initializations, you would have to set the seed to get the same pseudo random values or reload the state_dict.