Problem with vanilla RNN

Hi, I am coming from the Keras world. I am trying to create a simple experiment for RNN - denoise a sine wave. Thus, given a sine wave with noise, the model should estimate a denoised version.

I have a few issues:

  1. My code wouldn’t work without retain_graph=True. Why do I have to use retain_graph=True?

  2. This code doesn’t work when I set the hidden_size to be more than 1.

  3. Am I correctly using RNN? I’d really love to help with some more simple examples for RNNs/LSTMs

     import torch
     import torch.nn as nn
     from torch.nn import functional as F
     from torch.autograd import Variable
     from torch import optim
     import numpy as np
     import math, random
     import matplotlib.pyplot as plt
    
     # Borrowed from https://gist.github.com/spro/ef26915065225df65c1187562eca7ec4
    
     def sine_2(X, signal_freq=60.):
         return np.sin(2 * np.pi * (X) / signal_freq)
    
     def noisy(Y, noise_range=(-0.15, 0.15)):
         noise = np.random.uniform(noise_range[0], noise_range[1], size=Y.shape)
         return Y + noise
    
     def sample(sample_size):
         random_offset = random.randint(0, sample_size)
         X = np.arange(sample_size)
         out = sine_2(X + random_offset)
         inp = noisy(out)
         return inp, out
    
    
     input_dim = 1
     hidden_size = 1
     num_layers = 1
     rnn = nn.RNN(input_size=input_dim, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
     hidden = None
    
     optimizer = torch.optim.Adam(rnn.parameters(), lr=1e-2)   
     loss_func = nn.MSELoss()
    
     for t in range(10):
         inp, out = sample(100)
         inp = Variable(torch.Tensor(inp.reshape((1, -1, 1))), requires_grad=True)
         out = Variable(torch.Tensor(out.reshape((1, -1, 1))) )
         pred, hidden = rnn(inp, hidden)
         optimizer.zero_grad()
         loss = loss_func(pred, out)   
         loss.backward(retain_graph=True)
         optimizer.step()
1 Like

Your hidden carries the computation graph from last iteration. That’s why it errors without retain_graph.

Suppose that each iteration is supposed to train on an independent sample. Just always use hidden=None as your input to rnn.

2 Likes

Many thanks. Makes sense! As far as issue #2 is concerned, would that require a nn.Linear?

What’s the error you are seeing with hidden size > 1?

With hidden_size = 2, I get the following error:

RuntimeError: input and target have different number of elements: input[1 x 100 x 2] has 200 elements, while target[1 x 100 x 1] has 100 elements at /Users/soumith/miniconda2/conda-bld/pytorch_1503975723910/work/torch/lib/THNN/generic/MSECriterion.c:1234

Oh I see. The output size for RNN is [seq_len, batch, hidden_size * num_directions], i.e., it is directly outputting the hidden stats at each time step. So yes, a linear layer nn.Linear(hidden_size, 1) should work.

Doc is here for your reference: http://pytorch.org/docs/master/nn.html#torch.nn.RNN

1 Like

Makes sense. However, changing to

rnn = nn.RNN(input_size=input_dim, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
linear = nn.Linear(hidden_size, 1)

and, in the loop:

pred, hidden = rnn(inp, hidden)
pred = linear(pred)

makes learning much worse. Definitely slower and doesn’t converge well. Even for hidden=1, the performance is much worse with the addition of nn.Linear().

Did you initialize the linear’s weights properly? A very simple normal init would be

fc.weight.data.normal_(0, 0.02)
fc.bias.data.zero_()

Also, ideally there should be non-linear activations between them, but i don’t think it matters too much in this case.

So, I made the following changes:

linear = nn.Linear(hidden_size, 1, bias=False)
linear.weight.data.normal_(0, 0.02)

However, this makes the model perform even worse!

my bad, see reply below

Thanks for your efforts. Before starting this example, I’d thought it should be trivial. Didn’t turn out so!

I made a mistake, the non-linear layer should be added at the end, since the RNN output already has non-linear layer applied.

Apparently pytorch does initialization by default :slight_smile: So if you don’t manually initialize, it will also work. However, I played with different initializations, and the results vary a lot. With initialization close to zero (e.g., the one I gave), the results are quite bad. But it get significantly better and close to hidden_size=1 when I use larger range, e.g. uniform(-2, 2). I also tried to initialize the layer as weight = [[0], [1]], bias = 0, i.e. always taking the second element, and also got results as good as hidden_size=1.

In addition, the results varies run by run a lot as well, indicating that the added layer made the network less stable. This is intuitive as the input_size and output_size are both only 1, so adding more parameters may unnecessarily complicate the error space. And it seems that adding Linear(2, 1) made the network more sensitive to initialization, which can be a bad thing.

Finally, I found increase num_layer can improve the results a bit.

here’s what I changed to your code if you want to give it a try


input_dim = 1
hidden_size = 2
num_layers = 1
rnn = nn.RNN(input_size=input_dim, hidden_size=hidden_size, num_layers=num_layers
, batch_first=True)
tanh = nn.Tanh()
linear = nn.Linear(hidden_size, 1)
linear.weight.data.uniform_(-3, 3)
linear.bias.data.zero_()

optimizer = torch.optim.Adam(rnn.parameters(), lr=1e-2)
loss_func = nn.MSELoss()

for t in range(1000):
    inp, out = sample(100)
    inp = Variable(torch.Tensor(inp.reshape((1, -1, 1))), requires_grad=True)
    out = Variable(torch.Tensor(out.reshape((1, -1, 1))) )
    pred, hidden = rnn(inp, None)
    pred = tanh(linear(pred.view(-1, hidden_size))).view(1, -1, 1)
    optimizer.zero_grad()
    loss = loss_func(pred, out)
    print(t, loss.data[0])
    loss.backward()
    optimizer.step()
1 Like

Thanks a ton. I am sure this took effort - fiddling with parameters, activations.

It’s hard for me to understand why we aren’t learning something very useful here! I calculated the MSE loss between (input, output) and between (pred, output). The MSE is lesser for (input, output)! Which means we’re aren’t denoising well enough, infact- not modifying the input would have resulted in lower MSE.

Would increasing the number of training samples help? Or, should I make the problem tougher to ensure that the network can learn a better function than identity?

Should we really have two RNNs as is usually done in Seq2seq tasks - one being an encoder, the other being a decoder?

I also feel that this trivial example would work substantially better if we could incorporate the notion of attention.

Based on inputs in this thread, I wrote a blog post here for denoising using RNNs in PyTorch. I feel it could be a useful example to add to the documentation.
https://nipunbatra.github.io/blog/2018/denoising.html

Pinginng @smth, @apaszke and team.

Happy to improve the post and get your feedback.