Problem with vanilla RNN

Nipun_Batra · January 12, 2018, 8:09pm

Hi, I am coming from the Keras world. I am trying to create a simple experiment for RNN - denoise a sine wave. Thus, given a sine wave with noise, the model should estimate a denoised version.

I have a few issues:

My code wouldn’t work without retain_graph=True. Why do I have to use retain_graph=True?
This code doesn’t work when I set the hidden_size to be more than 1.

Am I correctly using RNN? I’d really love to help with some more simple examples for RNNs/LSTMs

 import torch
 import torch.nn as nn
 from torch.nn import functional as F
 from torch.autograd import Variable
 from torch import optim
 import numpy as np
 import math, random
 import matplotlib.pyplot as plt

 # Borrowed from https://gist.github.com/spro/ef26915065225df65c1187562eca7ec4

 def sine_2(X, signal_freq=60.):
     return np.sin(2 * np.pi * (X) / signal_freq)

 def noisy(Y, noise_range=(-0.15, 0.15)):
     noise = np.random.uniform(noise_range[0], noise_range[1], size=Y.shape)
     return Y + noise

 def sample(sample_size):
     random_offset = random.randint(0, sample_size)
     X = np.arange(sample_size)
     out = sine_2(X + random_offset)
     inp = noisy(out)
     return inp, out


 input_dim = 1
 hidden_size = 1
 num_layers = 1
 rnn = nn.RNN(input_size=input_dim, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
 hidden = None

 optimizer = torch.optim.Adam(rnn.parameters(), lr=1e-2)   
 loss_func = nn.MSELoss()

 for t in range(10):
     inp, out = sample(100)
     inp = Variable(torch.Tensor(inp.reshape((1, -1, 1))), requires_grad=True)
     out = Variable(torch.Tensor(out.reshape((1, -1, 1))) )
     pred, hidden = rnn(inp, hidden)
     optimizer.zero_grad()
     loss = loss_func(pred, out)   
     loss.backward(retain_graph=True)
     optimizer.step()

SimonW · January 12, 2018, 8:18pm

Your hidden carries the computation graph from last iteration. That’s why it errors without retain_graph.

Suppose that each iteration is supposed to train on an independent sample. Just always use hidden=None as your input to rnn.

Nipun_Batra · January 12, 2018, 8:21pm

Many thanks. Makes sense! As far as issue #2 is concerned, would that require a nn.Linear?

SimonW · January 12, 2018, 8:23pm

What’s the error you are seeing with hidden size > 1?

Nipun_Batra · January 12, 2018, 8:24pm

With hidden_size = 2, I get the following error:

RuntimeError: input and target have different number of elements: input[1 x 100 x 2] has 200 elements, while target[1 x 100 x 1] has 100 elements at /Users/soumith/miniconda2/conda-bld/pytorch_1503975723910/work/torch/lib/THNN/generic/MSECriterion.c:1234

SimonW · January 12, 2018, 8:34pm

Oh I see. The output size for RNN is [seq_len, batch, hidden_size * num_directions], i.e., it is directly outputting the hidden stats at each time step. So yes, a linear layer nn.Linear(hidden_size, 1) should work.

Doc is here for your reference: http://pytorch.org/docs/master/nn.html#torch.nn.RNN

Nipun_Batra · January 12, 2018, 8:45pm

Makes sense. However, changing to

rnn = nn.RNN(input_size=input_dim, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
linear = nn.Linear(hidden_size, 1)

and, in the loop:

pred, hidden = rnn(inp, hidden)
pred = linear(pred)

makes learning much worse. Definitely slower and doesn’t converge well. Even for hidden=1, the performance is much worse with the addition of nn.Linear().

SimonW · January 12, 2018, 8:49pm

Did you initialize the linear’s weights properly? A very simple normal init would be

fc.weight.data.normal_(0, 0.02)
fc.bias.data.zero_()

Also, ideally there should be non-linear activations between them, but i don’t think it matters too much in this case.

Nipun_Batra · January 12, 2018, 8:54pm

So, I made the following changes:

linear = nn.Linear(hidden_size, 1, bias=False)
linear.weight.data.normal_(0, 0.02)

However, this makes the model perform even worse!

SimonW · January 12, 2018, 8:55pm

my bad, see reply below

Nipun_Batra · January 12, 2018, 9:20pm

Thanks for your efforts. Before starting this example, I’d thought it should be trivial. Didn’t turn out so!

SimonW · January 12, 2018, 10:08pm

I made a mistake, the non-linear layer should be added at the end, since the RNN output already has non-linear layer applied.

Apparently pytorch does initialization by default So if you don’t manually initialize, it will also work. However, I played with different initializations, and the results vary a lot. With initialization close to zero (e.g., the one I gave), the results are quite bad. But it get significantly better and close to hidden_size=1 when I use larger range, e.g. uniform(-2, 2). I also tried to initialize the layer as weight = [[0], [1]], bias = 0, i.e. always taking the second element, and also got results as good as hidden_size=1.

In addition, the results varies run by run a lot as well, indicating that the added layer made the network less stable. This is intuitive as the input_size and output_size are both only 1, so adding more parameters may unnecessarily complicate the error space. And it seems that adding Linear(2, 1) made the network more sensitive to initialization, which can be a bad thing.

Finally, I found increase num_layer can improve the results a bit.

here’s what I changed to your code if you want to give it a try


input_dim = 1
hidden_size = 2
num_layers = 1
rnn = nn.RNN(input_size=input_dim, hidden_size=hidden_size, num_layers=num_layers
, batch_first=True)
tanh = nn.Tanh()
linear = nn.Linear(hidden_size, 1)
linear.weight.data.uniform_(-3, 3)
linear.bias.data.zero_()

optimizer = torch.optim.Adam(rnn.parameters(), lr=1e-2)
loss_func = nn.MSELoss()

for t in range(1000):
    inp, out = sample(100)
    inp = Variable(torch.Tensor(inp.reshape((1, -1, 1))), requires_grad=True)
    out = Variable(torch.Tensor(out.reshape((1, -1, 1))) )
    pred, hidden = rnn(inp, None)
    pred = tanh(linear(pred.view(-1, hidden_size))).view(1, -1, 1)
    optimizer.zero_grad()
    loss = loss_func(pred, out)
    print(t, loss.data[0])
    loss.backward()
    optimizer.step()

Nipun_Batra · January 12, 2018, 10:17pm

Thanks a ton. I am sure this took effort - fiddling with parameters, activations.

It’s hard for me to understand why we aren’t learning something very useful here! I calculated the MSE loss between (input, output) and between (pred, output). The MSE is lesser for (input, output)! Which means we’re aren’t denoising well enough, infact- not modifying the input would have resulted in lower MSE.

Would increasing the number of training samples help? Or, should I make the problem tougher to ensure that the network can learn a better function than identity?

Should we really have two RNNs as is usually done in Seq2seq tasks - one being an encoder, the other being a decoder?

I also feel that this trivial example would work substantially better if we could incorporate the notion of attention.

Nipun_Batra · January 13, 2018, 9:20pm

Based on inputs in this thread, I wrote a blog post here for denoising using RNNs in PyTorch. I feel it could be a useful example to add to the documentation.
https://nipunbatra.github.io/blog/2018/denoising.html

Pinginng @smth, @apaszke and team.

Happy to improve the post and get your feedback.