Implementing Transformer Model in PyTorch for Video

Dear Altruist,

I am seeking assistance in creating a Transformer code using PyTorch, following the provided instructions:

Input shape: [Batch_size, 10, 512]
Here, ‘10’ represents the video length in frames, and ‘512’ signifies the feature dimension of each frame.

Output shape: [Batch_size, 10, 4]

To provide some context about my challenge, each video comprises 10 frames, with each frame’s features represented as a 512-dimensional vector. This configuration results in a video dimension of [10, 512]. My objective is to employ a Transformer model to process batches of videos, such as [Batch_size, 10, 512], to capture the sequential relationships among the 10 frames. This captured knowledge will then be used to predict a 4-dimensional vector for each frame over the subsequent 10 frames. Consequently, the desired output format is [Batch_size, 10, 4].

I am using the following code, but the test loss is not decreasing:

class Transformer(nn.Module):
    def __init__(self, input_dim,hidden_dim, output_dim, num_heads, num_layers, seq_len):
        super(Transformer, self).__init__()

        self.embedding = nn.Linear(input_dim, hidden_dim)
        self.positional_encoding = PositionalEncoding(hidden_dim)
        self.dim_model = hidden_dim  #256

        encoder_layer = nn.TransformerEncoderLayer(
            nhead=num_heads #8
        self.encoder = nn.TransformerEncoder(
            num_layers=num_layers #4
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        x = x.permute(1,0,2)

        x = self.encoder(x)
        x = x.permute(1,0,2)

        x = self.fc(x)

        return x

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=0.1)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x +[:x.size(0), :]
        return self.dropout(x)

I would greatly appreciate your guidance and support in implementing this task.

Thank you.

What’s the input dim for your video? i.e. how big each frame in the video in terms of WxH as well as the representation you’re using as input to your model? It would be hard to determine what’s wrong without knowing more details about the specific use-case and what you’re trying to predict.

Thank you for your prompt response. I have extracted ResNet50 features for each individual video frame and subsequently transformed them into 256-dimensional feature vectors. I also have object bounding box coordinate in each of those video frames which I then converted into a 256-dimenstional feature space and concatenated with the frames feature vector, resulting in 512-dimensional feature vector as input. These vectors are then inputted into the Transformer for further processing.
And I am trying to predict the bounding box coordinate of that object for the next 10 frames [batch, 10, 4].

I’m assuming this means that the train loss is decreasing. In such a case, it seems like a case of overfitting. i.e. your model architecture seems sound, since it’s able to learn, but you probably have very little training data or you are training it for too many epochs. Try getting more data or applying some data augmentations, etc…

This just a best guess based on whatever you have described so far.