Seq2seq in Video summarization

I am a Pytorch beginner and I have read the Pytorch seq2seq tutorial for machine translation .I have a Data of (video frames , summary ) pairs used in video summarization. I want to adjust the encoder class in the tutorial to make it take sequence of video frames(each frame is represented as 2048 features extracted from a CNN) instead of a sequence of words as an input. one input sequence is of size(NumOfFrames X 2048). how do i go about doing that ?

thanks in advance

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        if use_cuda:
            return result.cuda()
            return result

You can do something similar to how the shape of embedded is changed to a 3D tensor so that it can be fed to the LSTM:

One solution is:
input_sequence = input_sequence.view(1, 1, -1)

Another approach is to convert it into a 3D tensor without pushing all the features into a single dimension:

input_seq = input_seq.view(1, num_of_frames, -1)

In this case output will not be squeeze-able to a scalar. So you will have to make some changes in the code structure to accommodate the attention mechanism.

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.gru = nn.GRU(2048, hidden_size)

    def forward(self, input):
        # input should be (num_frames, batch_size, 2048)
        output, hidden = self.gru(input)
        return output, hidden

Should work.


  1. you shouldn’t have to manually initialize the hidden state in the encoder rnn as it processes the whole sequence into an encoded representation at one go so you never have to manually pass the state back in.

  2. you may want to throw those 2048 dim features through a few linear layers first as 2048 dim is pretty large for an rnn.

each input data is of size(num_frames , 2048) how can I make it of size (num_frames ,batch_size ,2048) ?

# or
Input.view(-1, 1, 2048)

If you want to use batches (assuming you’re using a data loader) you would write a collate function to pass to the loader which takes a batch and pads all of the sequences within the batch to the same length, adds them to a list then concatenates them on dim 1.

I’m not sure if the seq2seq tutorial was ever fixed to work with batches so you may have to make some other alterations. It’s quite old but there are plenty of seq2seq examples floating around github.