Regarding GRU training for seq-to-seq models (autoencoder)

I am learning how to build a sequence-to-sequence model from a variety of sources and many of them show the input to a GRU unit to be an iteration over a sequence, for instance a series of one-hot encoded words in a sentence, alongside a hidden state that updates at each pass. Here is one of the tutorials I am referring to.

I am trying to replicate this in batch training but I am a bit confused by the documentation for GRUs. Given that we train RNNs by iterating through each token in a sequence, when would the sequence length ever be greater than one?

My data is fashioned such that batches are fed into the encoder in chunks [batch_size, maximum_token_length, num_features/one_hot_labels], and having batch_first = True on the GRUs.

Given this, what would be the difference between these two blocks of code?

for i in range(max_input_length):
    latent, hidden = encoder(this_batch[:,i,:].reshape(batch_size,1, feature_size), hidden)

#vs

latent, hidden = encoder(this_batch, hidden)

Does the GRU just iterate through the sequence as I would with the for loop? Clearly this is the case when comparing output values from both methods. Why use one versus another?

Here is my yet-so-far unsuccessful training loop for an autoencoder with some notes to my thought process. The loss decreases incredibly slowly despite running on a GPU with a batch size of 32.

Am I fundamentally misunderstanding here or am I being impatient with training time? I’m getting meaningless reconstructions of sequences. I’m not sure if passing the latent representation from the encoder into the decoder at each time step is correct, and I am presuming the blank initialized hidden state would represent the token.

Any thoughts or help would be greatly appreciated! Thanks a lot.

Hi! I can offer at least some couple of thoughts:

  • I have a working implementation of RNN-based autoencoder you can have a look. It’s a bit verbose since I looked into support of a Variational Autoencoder which require to flatten the hidden state of the encoder. If you encoder and decoder have the same architecture (i.e., the hidden state have the same shape), then you can copy the last hidden state of the encoder over as the first hidden state of the decoder.

  • I’m a bit confused why you have a decoder.initHidden() call in your code. In the bases Seq2Seq setup the hidden state of the decoder gets directly initialized with the last hidden state of the decoder. I’m also not really sure what latent as output of you encoder is, particularly compared to hidden (in latent, hidden = encoder(...))

  • You have 2 reshape() calls. Make sure that these calls do exactly what you intend them to do. Sure, they fix your dimensions but they are likely to mess up your data. I’ve wrote a more detailed forum post on this since it’s such a common occurrence. Without really checking it, my money is that your reshape() call are wrong :).

  • From my experience, an Autoencoder for text can be difficult to train – an Variational Autoencoder even more so. I have so many epoch where the loss goes down slowly until at some point something “snaps in” in the loss goes down more significantly.

I hope that helps a bit.

EDIT: I was just seeing this sentence: “I’m not sure if passing the latent representation from the encoder into the decoder at each time step is correct, and I am presuming the blank initialized hidden state would represent the token.”

Again, the basic setup is that your input runs through the encoder. This gives you the last hidden state, which you then use as initial hidden state of the decoder. There is no passing of hidden states between the encoder and decoder at each time step. At least not in the basic encoder-decoder setup – I’m sure there are some more advanced techniques that do more fancy stuff. I’ve added below a couple of my lecture slides to visualize the idea.



Thanks so much for the reply!

I kept cracking at it and eventually figured out that the main issue was that I was using the encoder incorrectly. Instead I added a “[BEGIN]” token to each sequence and use that as the initial input and had the hidden state initialized from the latent state from the encoder (I have two outputs from my encoder, the hidden state and a compressed vector of each output of the gpu). I suppose I can use both the latent output vector of the encoder and the final hidden state, or just one as the initial hidden state of the decoder.

Then depending on if there’s teacher forcing or not, the output of the decoder is the next input, along with the new hidden state.

Now I’m getting loss going down quite quickly at first… then it just explodes.

I will have to look closer into my reshape calls. Good catch on that. That might be the main issue now.

I’m not sure what you mean by latent, as “latent” basically means hidden :).

But again, this all depends on the architecture you implement for the decoder, and this can be anything. I was just referring to the most basic RNN-based encoder-decoder architecture (without attention or anything). As a rule of thumb, the encoder gives some kind of “summary” of the input to the decoder. For the basic architecture that’s simple the last hidden state.

Essentially I have two outputs from the encoder:


One being the GRU output and the other being the updated hidden vector that’s refed at each iteration. I guess at some point in my education I learned that the compressed bottleneck layer of an autoencoder is the latent space, and the hidden memory state of an RNN being its own separate concept.

That aside, my model is converging but still quite slow (loss from 110 to 80 over 10 hours on a tesla V100… only about 20 epochs though). You were totally right about my reshape calls being incorrect, and after checking all data flows everything is functioning as intended. At this point I feel that the training loop is correct and now I’m just trying to speed this up since my goal is to have a working autoencoder and eventually VAE in a tractable amount of time.

I noticed in your code you have options for bidirectionality and an embedding layer. I think I will next add an embedding layer since that seems to help a lot, then add an attention to create a Transformer model. Is there more I can do to keep improving the model? Can I train on multiple GPUs? It’s hard to get a sense as to whether I just have to truly train for days or if I’m just doing this inefficiently, but I can’t find good “time” benchmarks online anywhere. Should I make my batch size as large as my GPU can handle?

Sorry for the rapidfire questions, I feel that I’m real close now! Thanks again!!!

OK, you latent is the output for the GRU layer pushed through a linear layer. Well, output contains all hidden states over the whole sequence. I’m not quite sure how you utilize it since the shape of output is (batch_size, seq_len, num_directions*hidden_dim) (ok, num_directions=1 in your case).

So you give each decoder time step some information about the whole input sequence. This in some kind of attention. See the slides below; I think you’re doing something like this (not the same though!):