Timedistributed CNN


I am implementing a paper’s architecture that does Time distributed CNN over the input. For the sake of clarification and with the input in the form of (batch_size, time_steps, channels, H, W):
let’s say the input is (32, 100, 1, 128, 128) and after applying the convolution with 16 kernels I get (32, 100, 16, 64, 64).

after reading through the forum, I used the trick of multiplying the dimensions batch_size and time_steps, and then reshape when done with convolution.

Okay, I tried this, and the shapes are good, and code runs well. However, when I checked the number of parameters in my model, it was less than that in the paper. And I believe the reason is, it is using the same kernel weights over the whole batch and time_steps.

Which means blending the dimensions isn’t doing actually timeDistributed convolution. Any insights to solve the problem ?

Thank you

Great news! I think I finally got the result that I needed using the nn.ModuleList() !

class TimeDistributed(nn.Module):
    def __init__(self, layer, time_steps, *args):        
        super(TimeDistributed, self).__init__()
        self.layers = nn.ModuleList([layer(*args) for i in range(time_steps)])

    def forward(self, x):

        batch_size, time_steps, C, H, W = x.size()
        output = torch.tensor([])
        for i in range(time_steps):
          output_t = self.layers[i](x[:, i, :, :, :])
          output_t  = y.unsqueeze(1)
          output = torch.cat((output, output_t ), 1)
        return output

And checked this by counting the number of parameters after using Conv2D on 100 time_steps:

x = torch.rand(20, 100, 1, 5, 9)

model = TimeDistributed(nn.Conv2d, time_steps = 100, 1, 8, (3, 3) , 2,   1 ,True)
output = model(x)

print(output.size())   ## (20, 100, 8, 3, 5)

print("number of parameters : ", sum([p.numel() for p in model.parameters()]))

## number of parameters :  8000                           instead of 80

or with Batchnormalization:

x = torch.rand(20, 100, 8, 3, 5)
model = TimeDistributed(nn.BatchNorm2d, time_steps = 100, 8)```

Hey @ilyes. Thanks for the implementation. I was wondering what is y in your forward function? Also, does it mean that the activation has to be time distributed i.e for every conv2d in the ModuleList?

it’s a mistake, it’s output_t instead !
the activation has no parameters to learn, so I believe you cant just wrap it around the final output of your distributed layer.

1 Like

Hey @ilyes, thanks for this solution. Is it possibile to “pack” inside TimeDistriubted also:



TimeDistributed(nn.ReLU(inplace=True), time_steps = 20)
TimeDistributed(nn.Dropout(p=0.5), time_steps = 20)


This is great, but two questions?
1)Can you just create one layer instead of a list of the same layers time_step times and just loop the output through that single layer? Should be the same right?
2) if you compare the number of parameters using Keras TimeDistributed and this implementation you see that it increase quite a lot. Seems like Pytorch creates all time_step times layers.

  1. For an equivalent of Keras TimeDistributed you want indeed just a single module.
  2. Yes, as you noted, duplicating the module might not be the right thing.

As an aside, there is an old issue requesting this feature: https://github.com/pytorch/pytorch/issues/1927

Best regards


Yes, using that old issue implementation saves on the number of parameters but I am not sure is the right way as well as it is basically reshaping the input to be able to apply the module and once applied reshaping to the original shape of the input. so you might lose some temporal information in this way (?) And the purpose of the timedistributed module.

I will train two models using both versions and see which leads to better results or makes more sense

What is the main focus of this implementation and how it differ from nn.Conv3d?

1 Like

Hey @ilyes, I think your implementation is different from Keras TimeDistributed layer. You duplicated the layers for each time step, which means different processing going on at different time steps. However, when I tried Keras TimeDistributed layer, the number of trainable parameters are different from your implementation:

Am I missing something?

I implemented a version myself that supports multiple arguments:

class TimeDistributed(nn.Module):
    "Applies a module over tdim identically for each step" 
    def __init__(self, module, low_mem=False, tdim=1):
        super(TimeDistributed, self).__init__()
        self.module = module
        self.low_mem = low_mem
        self.tdim = tdim
    def forward(self, *args, **kwargs):
        "input x with shape:(bs,seq_len,channels,width,height)"
        if self.low_mem or self.tdim!=1: 
            return self.low_mem_forward(*args)
            #only support tdim=1
            inp_shape = args[0].shape
            bs, seq_len = inp_shape[0], inp_shape[1]   
            out = self.module(*[x.view(bs*seq_len, *x.shape[2:]) for x in args], **kwargs)
            out_shape = out.shape
            return out.view(bs, seq_len,*out_shape[1:])
    def low_mem_forward(self, *args, **kwargs):                                           
        "input x with shape:(bs,seq_len,channels,width,height)"
        tlen = args[0].shape[self.tdim]
        args_split = [torch.unbind(x, dim=self.tdim) for x in args]
        out = []
        for i in range(tlen):
            out.append(self.module(*[args[i] for args in args_split]), **kwargs)
        return torch.stack(out,dim=self.tdim)
    def __repr__(self):
        return f'TimeDistributed({self.module})'

you use it like this:

tdconv = TimeDistributed(nn.Conv2d(2, 5, 3, 1, 1), tdim=1)

and then feed a tensor with dimension: bs, seq_len, ch, h, w, you have to tell in which dim is the distribution of time:

tdconv(torch.rand(3, 10, 2, 8, 8))

It has two forwards, one that send over all the computation over the batch and another one using a for loop.

  • I would not mind some help to convert this to torch.script

Hi there, I’m not sure that using for-loop will save the memory… I think those two will have the same memory cost in training.

I wrote one for myself too…

class TimeDistributed(nn.Module):

    def __init__(self, module, batch_first=False):
        super(TimeDistributed, self).__init__()
        self.module = module
        self.batch_first = batch_first

    def forward(self, x):
        ''' x size: (batch_size, time_steps, in_channels, height, width) '''
        batch_size, time_steps, C, H, W = x.size()
        c_in = x.view(batch_size * time_steps, C, H, W)
        c_out = self.module(c_in)
        r_in = c_out.view(batch_size, time_steps, -1)
        if self.batch_first is False:
            r_in = r_in.permute(1, 0, 2)
        return r_in
1 Like

They don’t.
Try passing a resne50 with a large batch size. Also, the unbinded forward can deal with non contigous memory tensors that the simple view forward can’t.
I really want a ConsLSTM/ConvGRU pytorch native module…

Thanks for your reply. Indeed, the unbinded forward can deal with non-contigous memory…

I haven’t good it yet. Is this module you created working correctly? the reshaping worked as expected?

I think for the case of Conv layers, the @ilyes implementation is different from the one in Keras, as doc in Keras mentions that all timestamps shares the weights, but the previous implementation create new instances for each timestamp. I would prefer the reshape one.