I intend to build a neural network with 4 layers and each layers has 16 units
My first approach is
h_dim = 256
x_dim = 2
nets = lambda : nn.Sequential(nn.Linear(x_dim, h_dim),
nn.LeakyReLU(),
nn.Linear(h_dim, h_dim),
nn.LeakyReLU(),
nn.Linear(h_dim, x_dim),
nn.Tanh())
nett = lambda : nn.Sequential(nn.Linear(x_dim, h_dim),
nn.LeakyReLU(),
nn.Linear(h_dim, h_dim),
nn.LeakyReLU(),
nn.Linear(h_dim, x_dim))
self.t = torch.nn.ModuleList([nett() for _ in range(num_network)])
self.s = torch.nn.ModuleList([nets() for _ in range(num_network)])
My second approach is
hidden_layers = [256,256]
t_block = []
hs = [x_dim] + hidden_layers + [x_dim]
for h0, h1 in zip(hs, hs[1:]):
t_block.extend([
nn.Linear(h0, h1),
nn.LeakyReLU(),
])
t_block.pop() # pop the last ReLU for the output layer
nett = nn.Sequential(*t_block)
s_block = []
for h0, h1 in zip(hs, hs[1:]):
s_block.extend([
nn.Linear(h0, h1),
nn.LeakyReLU(),
])
s_block.pop() # pop the last ReLU for the output layer
s_block.extend([nn.Tanh()])
nets = nn.Sequential(*s_block)
self.t = torch.nn.ModuleList([nett for _ in range(num_network)])
self.s = torch.nn.ModuleList([nets for _ in range(num_network)])
These two approach are equivalent; however, across several experiments; I found second approach has much worse training and test results than the first one (Even though I have force the initialization to be same). The difference is non-trivial, can anyone explain the reason behind this ?