I made a mistake, the non-linear layer should be added at the end, since the RNN output already has non-linear layer applied.
Apparently pytorch does initialization by default So if you don’t manually initialize, it will also work. However, I played with different initializations, and the results vary a lot. With initialization close to zero (e.g., the one I gave), the results are quite bad. But it get significantly better and close to hidden_size=1 when I use larger range, e.g. uniform(-2, 2). I also tried to initialize the layer as weight = [[0], [1]]
, bias = 0
, i.e. always taking the second element, and also got results as good as hidden_size=1.
In addition, the results varies run by run a lot as well, indicating that the added layer made the network less stable. This is intuitive as the input_size and output_size are both only 1, so adding more parameters may unnecessarily complicate the error space. And it seems that adding Linear(2, 1)
made the network more sensitive to initialization, which can be a bad thing.
Finally, I found increase num_layer
can improve the results a bit.
here’s what I changed to your code if you want to give it a try
input_dim = 1
hidden_size = 2
num_layers = 1
rnn = nn.RNN(input_size=input_dim, hidden_size=hidden_size, num_layers=num_layers
, batch_first=True)
tanh = nn.Tanh()
linear = nn.Linear(hidden_size, 1)
linear.weight.data.uniform_(-3, 3)
linear.bias.data.zero_()
optimizer = torch.optim.Adam(rnn.parameters(), lr=1e-2)
loss_func = nn.MSELoss()
for t in range(1000):
inp, out = sample(100)
inp = Variable(torch.Tensor(inp.reshape((1, -1, 1))), requires_grad=True)
out = Variable(torch.Tensor(out.reshape((1, -1, 1))) )
pred, hidden = rnn(inp, None)
pred = tanh(linear(pred.view(-1, hidden_size))).view(1, -1, 1)
optimizer.zero_grad()
loss = loss_func(pred, out)
print(t, loss.data[0])
loss.backward()
optimizer.step()