Any PyTorch function can work as Keras' Timedistributed?

Hi! I used to be a Keras user, I want to port my functions to PyTorch. Recently I work on a video classification problem, which uses a similar architecture as LRCN (, which applys CNN to extract features from each frame, then use LSTM for classification. In Keras, there is a timedistributed function ( which can apply a layer to each temporal slice, I wonder PyTorch has similar implementations or how I can achieve similar function in this case? Any existing PyTorch example for it?

Thanks in advance for your patience and help!!



from the top of my head, I think that the model in Sean Naren’s deepspeech.pytorch does something very similar to what you want to achieve with the SequenceWise class:

Best regards



Hi, Tom. Thanks for your sharing! I’ll try to look into that!




I developed a PyTorch module that mimics the TimeDistributed wrapper of Keras a few days ago:

import torch.nn as nn

class TimeDistributed(nn.Module):
    def __init__(self, module, batch_first=False):
        super(TimeDistributed, self).__init__()
        self.module = module
        self.batch_first = batch_first

    def forward(self, x):

        if len(x.size()) <= 2:
            return self.module(x)

        # Squash samples and timesteps into a single axis
        x_reshape = x.contiguous().view(-1, x.size(-1))  # (samples * timesteps, input_size)

        y = self.module(x_reshape)

        # We have to reshape Y
        if self.batch_first:
            y = y.contiguous().view(x.size(0), -1, y.size(-1))  # (samples, timesteps, output_size)
            y = y.view(-1, x.size(1), y.size(-1))  # (timesteps, samples, output_size)

        return y

Wow, cool! That’s pretty awwwwwesome!!!:grinning:

Could you give me some example on how to use this function to construct time distributed cnn + lstm?

Several images will be computed by CNN and feed to LSTM all together.

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        #x = F.relu(self.fc1(x))
        #x = F.dropout(x,
        #x = self.fc2(x)
        #return F.log_softmax(x, dim=1)
        return x

class Combine(nn.Module):
    def __init__(self):
        super(Combine, self).__init__()
        self.cnn = CNN()
        self.rnn = nn.LSTM(320, 10, 2)

    def forward(self, x):
        x = self.cnn(x)
        x = self.rnn(x)
        return F.log_softmax(x, dim=1)

Thanks for the sharing, I was thinking to loop the function, your implementation reminds me we are in OO environment; thanks a lot ~~~~~~

For most cases, this function is not needed anymore. The Dense layer now supports 3 dimensional inputs, for example.


@miguelvr you are right, right now the linear layer supports 3 dimensional inputs; thanks

Is putting a Dense layer after an RNN the same as applying a Dense layer to each time step though? Like in the first case don’t the time steps connect and mix together?

1 Like

@miguelvr Isn’t this still useful for other layers than Linear though? For example, the input tensor is of shape [sample, frame, image], like video, and you may want to apply a convnet module for each time frame. Please kindly correct me if I get this wrong.

Yes definitely, it still can be useful for other cases

thanks. I was looking for the timedistributed equivalent in pytorch and found your code…

Hi Miguelvr,

We have been using Time distributed layer that is developed by you.
I declared the Time distributed layer as follows :
1. Declared linear layer then give that output to the time distributed layer in the module
class CRNN(nn.Module):
def init(self):
super(CRNN, self).init()
# 1D CovNet for learning the Spectral features
self.conv1 = nn.Conv1d(in_channels=1, out_channels=128, kernel_size=(32,))
self.bn1 = nn.BatchNorm1d(128)
self.maxpool1 = nn.MaxPool1d(kernel_size=1, stride=97)
self.dropout1 = nn.Dropout(0.3)
# 1D LSTM for learning the temporal aggregation
self.lstm = nn.LSTM(input_size=128, hidden_size=128, num_layers=2, dropout=0.3)
# Fully Connected layer
#self.fc3 = nn.Linear(128, 128)
#self.bn3 = nn.BatchNorm1d(128)
# Get posterior probability for target event class
self.fc4 = nn.Linear(128, 1)
self.timedist = TimeDistributed(self.fc4)

But my doubt is When I the print the weight parameters of NN.

Time Distributor layer prints two times as follows

fc4.weight torch.Size([1, 128])

fc4.bias torch.Size([1])

timedist.module.weight torch.Size([1, 128])

timedist.module.bias torch.Size([1])

is it correct or any mistakes in the implementation.


Every nn.Linear object had a weight and a bias, so that’s correct

Thank you for your reply

Can you provide a small working example where this works? I have an input of the shape (samples, timesteps, channels, width, height). With your code, it combines all the dimensions except the last one which becomes input size as per your x_reshape. Then, it doesn’t work with any of the layers, giving a size mismatch error.

Thanks a lot for your nice explanation. I have a novice confusion: as batch samples and timesteps are squashed, won’t it have any problem in LSTM sequential learning? i.e when the sequence is reshaped to (samples, timesteps, output_size), will it retain the sequential (timesteps) features ordering for each sample as it was before squashing?

Did you resolve about the structure of your network on PyTorch? I am facing exactly the same problem and I am wondering if you can share the code of the network. I have to develop a CNN+LSTM network for video sequence classification.

2d CNN accepts 4d inputs only so you can pass the 5d tensor ( batch , timesteps , channles ,height ,width) as 4d tensor

1 Like