Neural network with GRU for time series

Hi all,

I am pretty new to PyTorch 0.4. I am currently trying to write an algorithm that takes in a time series with 3 inputs (each time t has three corresponding input values x1, x2, x3) which represent noise channels, and predicts a final output y which is the sum of an unknown signal and a particular function of x1, x2 and x3. You can skip what follows and go directly to the question at the end if you want, but I will give some context.

I am currently using a custom RNN model made of a nn.GRU layer with tanh activation + nn.Linear with reLU activation function.
I am using the Adam optimizer in the following way, and have tried different values of learning rate (1e-2 down to 1e-6) and weight decay without success:

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.999), eps=1e-09, weight_decay=0.01)

With inputs x1, x2, x3 normalized, my loss function decreases to around 1 and then starts oscillating but the network output is orders of magnitude wrong.

My question regards the use of the module nn.GRU(input_size, hidden_size, n_layers, batch_first=False). I assume that in my case the input_size would be 3, dictated by the fact that for each time t I have x1(t), x2(t), x3(t). The hidden_size is equal to the nn.Linear input size and n_layers of my choice.
What should the input data format be when I call my model(…)? I have a matrix X of dimensions [n_input points, input_size], as an example say n_input_points=0:1000 and input_size=3, which I load with along with the targets y with a batch size of, say, 10 seconds each:

shape(X) = [10,100,3]
shape(y) = [10,100,1]

Am I correct to feed that Torch.Tensor input X to the GRU (with batch_first=False) ?