Seq2seq raining- why is (len) timesteps used with nn.GRU?

sid-ls · April 20, 2020, 12:27am

I’m going through the very useful tutorial by PyTorch on Seq2Seq training.

Given that nn.GRU takes in input of shape (seq_len, batch, input_size), why does the encoder have N (seq_len) number of timesteps during training?

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]

From the docs GRU can take in a batch input of the above shape, shouldn’t it work without a N timesteps?

encoder_output, encoder_hidden = encoder(
            input_tensor)

Doesn’t calling GRU(input) take care of all the timesteps? If it was one just GRU cell (nn.GRUCell) then I can sort of undertsand why the above method is used, but it’s using a GRU layer instead

vdw · April 20, 2020, 4:06am

Hm, I also do not think that the loop for the encoder here is necessary (and probably would wouldn’t work correctly in case the GRU layer would be bidirectional). I would almost argue that this loop wasn’t in the first version of the tutorial since I used it as baseline for my own code, and I give the whole sequence at once to the GRU layer.

Doing it using a loop obviously allows you to modify the hidden state after each time step (if there are meaningful ways to do it), but here it’s not needed.

asdf11 · April 28, 2020, 11:39pm

A question regarding the input length to the used encoder:

The encoder consists of an amount (hidden_dim) of GRU cells, which means that the maximum length of the source sentence cant be higher than the hidden_dim, right? What happens if the source sentence length is longer than that?

sid-ls · April 29, 2020, 12:56am

The encoder doesn’t consist of a hidden_dim number of GRU cells. The input length is not related to hidden_dim

hidden_dim is the size of the hidden state you desire, and is a result of matrix multiplications on your input. Your input of shape (batch_size, seq_len, num_features) will result in GRU output shape of (batch_size, seq_len, hidden_dim)

asdf11 · April 29, 2020, 9:27am

Thanks for reply. Im not really sure if I got the idea right then:

I thought each (embedded) word is read in bei a single cell of an encoder? E.g. There is a sentence of 10 words, then Id need an Encoder of at least 10 GRU cells? So every GRU cell holds the information for one word?

Or is the input of an single GRU cell (which I thought is the hidden_dim) all 10 words?

Same for the decoder: I thought the max length of a predicted sentence is the length of a Decoder?

sid-ls · May 5, 2020, 8:20pm

Yes, if your encoder input sentence is of 10 words then you must have at least 10 GRU cells. 1 GRU cell there will take 2 inputs- the current word and a hidden state. This hidden state is either got from the encoder of the previous timestep or can be initialized