Encoder Decoder Network for Videos

I am trying to implement below encoder-decoder model for videos (Paper: Unsupervised Learning of Video Representations using LSTMs ([1502.04681] Unsupervised Learning of Video Representations using LSTMs)),

LSTM

I have gone through Pytorch tutorial on Seq2Seq models for translations, and found in LSTM encoder-decoder models need SOS and EOS tokens.

I have some questions,

  1. In above network also do we need the two tokens, then how to include them.
  2. Are these tokens just a vector of zeroes and ones (which i can manually choose).

Thanks.