# Problem with vanilla RNN

Hi, I am coming from the Keras world. I am trying to create a simple experiment for RNN - denoise a sine wave. Thus, given a sine wave with noise, the model should estimate a denoised version.

I have a few issues:

1. My code wouldn’t work without `retain_graph=True`. Why do I have to use `retain_graph=True`?

2. This code doesn’t work when I set the `hidden_size` to be more than 1.

3. Am I correctly using RNN? I’d really love to help with some more simple examples for RNNs/LSTMs

`````` import torch
import torch.nn as nn
from torch.nn import functional as F
from torch import optim
import numpy as np
import math, random
import matplotlib.pyplot as plt

# Borrowed from https://gist.github.com/spro/ef26915065225df65c1187562eca7ec4

def sine_2(X, signal_freq=60.):
return np.sin(2 * np.pi * (X) / signal_freq)

def noisy(Y, noise_range=(-0.15, 0.15)):
noise = np.random.uniform(noise_range, noise_range, size=Y.shape)
return Y + noise

def sample(sample_size):
random_offset = random.randint(0, sample_size)
X = np.arange(sample_size)
out = sine_2(X + random_offset)
inp = noisy(out)
return inp, out

input_dim = 1
hidden_size = 1
num_layers = 1
rnn = nn.RNN(input_size=input_dim, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
hidden = None

loss_func = nn.MSELoss()

for t in range(10):
inp, out = sample(100)
inp = Variable(torch.Tensor(inp.reshape((1, -1, 1))), requires_grad=True)
out = Variable(torch.Tensor(out.reshape((1, -1, 1))) )
pred, hidden = rnn(inp, hidden)
loss = loss_func(pred, out)
loss.backward(retain_graph=True)
optimizer.step()``````
1 Like

Your hidden carries the computation graph from last iteration. That’s why it errors without retain_graph.

Suppose that each iteration is supposed to train on an independent sample. Just always use hidden=None as your input to rnn.

2 Likes

Many thanks. Makes sense! As far as issue #2 is concerned, would that require a `nn.Linear`?

What’s the error you are seeing with hidden size > 1?

With `hidden_size = 2`, I get the following error:

RuntimeError: input and target have different number of elements: input[1 x 100 x 2] has 200 elements, while target[1 x 100 x 1] has 100 elements at /Users/soumith/miniconda2/conda-bld/pytorch_1503975723910/work/torch/lib/THNN/generic/MSECriterion.c:1234

Oh I see. The output size for RNN is `[seq_len, batch, hidden_size * num_directions]`, i.e., it is directly outputting the hidden stats at each time step. So yes, a linear layer `nn.Linear(hidden_size, 1)` should work.

Doc is here for your reference: http://pytorch.org/docs/master/nn.html#torch.nn.RNN

1 Like

Makes sense. However, changing to

``````rnn = nn.RNN(input_size=input_dim, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
linear = nn.Linear(hidden_size, 1)
``````

and, in the loop:

``````pred, hidden = rnn(inp, hidden)
pred = linear(pred)
``````

makes learning much worse. Definitely slower and doesn’t converge well. Even for `hidden=1`, the performance is much worse with the addition of `nn.Linear()`.

Did you initialize the linear’s weights properly? A very simple normal init would be

``````fc.weight.data.normal_(0, 0.02)
fc.bias.data.zero_()
``````

Also, ideally there should be non-linear activations between them, but i don’t think it matters too much in this case.

So, I made the following changes:

``````linear = nn.Linear(hidden_size, 1, bias=False)
linear.weight.data.normal_(0, 0.02)
``````

However, this makes the model perform even worse!

Thanks for your efforts. Before starting this example, I’d thought it should be trivial. Didn’t turn out so!

I made a mistake, the non-linear layer should be added at the end, since the RNN output already has non-linear layer applied.

Apparently pytorch does initialization by default So if you don’t manually initialize, it will also work. However, I played with different initializations, and the results vary a lot. With initialization close to zero (e.g., the one I gave), the results are quite bad. But it get significantly better and close to hidden_size=1 when I use larger range, e.g. uniform(-2, 2). I also tried to initialize the layer as `weight = [, ]`, `bias = 0`, i.e. always taking the second element, and also got results as good as hidden_size=1.

In addition, the results varies run by run a lot as well, indicating that the added layer made the network less stable. This is intuitive as the input_size and output_size are both only 1, so adding more parameters may unnecessarily complicate the error space. And it seems that adding `Linear(2, 1)` made the network more sensitive to initialization, which can be a bad thing.

Finally, I found increase `num_layer` can improve the results a bit.

here’s what I changed to your code if you want to give it a try

``````
input_dim = 1
hidden_size = 2
num_layers = 1
rnn = nn.RNN(input_size=input_dim, hidden_size=hidden_size, num_layers=num_layers
, batch_first=True)
tanh = nn.Tanh()
linear = nn.Linear(hidden_size, 1)
linear.weight.data.uniform_(-3, 3)
linear.bias.data.zero_()

loss_func = nn.MSELoss()

for t in range(1000):
inp, out = sample(100)
inp = Variable(torch.Tensor(inp.reshape((1, -1, 1))), requires_grad=True)
out = Variable(torch.Tensor(out.reshape((1, -1, 1))) )
pred, hidden = rnn(inp, None)
pred = tanh(linear(pred.view(-1, hidden_size))).view(1, -1, 1)
loss = loss_func(pred, out)
print(t, loss.data)
loss.backward()
optimizer.step()
``````
1 Like

Thanks a ton. I am sure this took effort - fiddling with parameters, activations.

It’s hard for me to understand why we aren’t learning something very useful here! I calculated the MSE loss between (input, output) and between (pred, output). The MSE is lesser for (input, output)! Which means we’re aren’t denoising well enough, infact- not modifying the input would have resulted in lower MSE.

Would increasing the number of training samples help? Or, should I make the problem tougher to ensure that the network can learn a better function than identity?

Should we really have two RNNs as is usually done in Seq2seq tasks - one being an encoder, the other being a decoder?

I also feel that this trivial example would work substantially better if we could incorporate the notion of attention.

Based on inputs in this thread, I wrote a blog post here for denoising using RNNs in PyTorch. I feel it could be a useful example to add to the documentation.
https://nipunbatra.github.io/blog/2018/denoising.html

Pinginng @smth, @apaszke and team.

Happy to improve the post and get your feedback.