Seq2seq raining- why is (len) timesteps used with nn.GRU?

I’m going through the very useful tutorial by PyTorch on Seq2Seq training.

Given that nn.GRU takes in input of shape (seq_len, batch, input_size), why does the encoder have N (seq_len) number of timesteps during training?

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]

From the docs GRU can take in a batch input of the above shape, shouldn’t it work without a N timesteps?

encoder_output, encoder_hidden = encoder(
            input_tensor)

Doesn’t calling GRU(input) take care of all the timesteps? If it was one just GRU cell (nn.GRUCell) then I can sort of undertsand why the above method is used, but it’s using a GRU layer instead

Hm, I also do not think that the loop for the encoder here is necessary (and probably would wouldn’t work correctly in case the GRU layer would be bidirectional). I would almost argue that this loop wasn’t in the first version of the tutorial since I used it as baseline for my own code, and I give the whole sequence at once to the GRU layer.

Doing it using a loop obviously allows you to modify the hidden state after each time step (if there are meaningful ways to do it), but here it’s not needed.

2 Likes

A question regarding the input length to the used encoder:

The encoder consists of an amount (hidden_dim) of GRU cells, which means that the maximum length of the source sentence cant be higher than the hidden_dim, right? What happens if the source sentence length is longer than that?

The encoder doesn’t consist of a hidden_dim number of GRU cells. The input length is not related to hidden_dim

hidden_dim is the size of the hidden state you desire, and is a result of matrix multiplications on your input. Your input of shape (batch_size, seq_len, num_features) will result in GRU output shape of (batch_size, seq_len, hidden_dim)

Thanks for reply. Im not really sure if I got the idea right then:

I thought each (embedded) word is read in bei a single cell of an encoder? E.g. There is a sentence of 10 words, then Id need an Encoder of at least 10 GRU cells? So every GRU cell holds the information for one word?

Or is the input of an single GRU cell (which I thought is the hidden_dim) all 10 words?

Same for the decoder: I thought the max length of a predicted sentence is the length of a Decoder?

Yes, if your encoder input sentence is of 10 words then you must have at least 10 GRU cells. 1 GRU cell there will take 2 inputs- the current word and a hidden state. This hidden state is either got from the encoder of the previous timestep or can be initialized

1 Like