Timedistributed CNN

ilyes · July 26, 2019, 10:24am

Hello,

I am implementing a paper’s architecture that does Time distributed CNN over the input. For the sake of clarification and with the input in the form of (batch_size, time_steps, channels, H, W):
let’s say the input is (32, 100, 1, 128, 128) and after applying the convolution with 16 kernels I get (32, 100, 16, 64, 64).

after reading through the forum, I used the trick of multiplying the dimensions batch_size and time_steps, and then reshape when done with convolution.

Okay, I tried this, and the shapes are good, and code runs well. However, when I checked the number of parameters in my model, it was less than that in the paper. And I believe the reason is, it is using the same kernel weights over the whole batch and time_steps.

Which means blending the dimensions isn’t doing actually timeDistributed convolution. Any insights to solve the problem ?

Thank you

ilyes · July 26, 2019, 1:18pm

Great news! I think I finally got the result that I needed using the nn.ModuleList() !

class TimeDistributed(nn.Module):
    def __init__(self, layer, time_steps, *args):        
        super(TimeDistributed, self).__init__()
        
        self.layers = nn.ModuleList([layer(*args) for i in range(time_steps)])

    def forward(self, x):

        batch_size, time_steps, C, H, W = x.size()
        output = torch.tensor([])
        for i in range(time_steps):
          output_t = self.layers[i](x[:, i, :, :, :])
          output_t  = y.unsqueeze(1)
          output = torch.cat((output, output_t ), 1)
        return output

And checked this by counting the number of parameters after using Conv2D on 100 time_steps:

x = torch.rand(20, 100, 1, 5, 9)

model = TimeDistributed(nn.Conv2d, time_steps = 100, 1, 8, (3, 3) , 2,   1 ,True)
output = model(x)

print(output.size())   ## (20, 100, 8, 3, 5)

print("number of parameters : ", sum([p.numel() for p in model.parameters()]))

## number of parameters :  8000                           instead of 80

or with Batchnormalization:


x = torch.rand(20, 100, 8, 3, 5)
model = TimeDistributed(nn.BatchNorm2d, time_steps = 100, 8)```

hash-ir · October 14, 2019, 2:29pm

Hey @ilyes. Thanks for the implementation. I was wondering what is y in your forward function? Also, does it mean that the activation has to be time distributed i.e for every conv2d in the ModuleList?

ilyes · October 14, 2019, 9:12pm

it’s a mistake, it’s output_t instead !
the activation has no parameters to learn, so I believe you cant just wrap it around the final output of your distributed layer.

Mauro_Carlin · January 28, 2020, 10:22am

Hey @ilyes, thanks for this solution. Is it possibile to “pack” inside TimeDistriubted also:

nn.ReLU()
nn.Dropout()

like:

TimeDistributed(nn.ReLU(inplace=True), time_steps = 20)
TimeDistributed(nn.Dropout(p=0.5), time_steps = 20)

Thanks.

Cristina_Segalin · March 3, 2020, 4:56pm

This is great, but two questions?
1)Can you just create one layer instead of a list of the same layers time_step times and just loop the output through that single layer? Should be the same right?
2) if you compare the number of parameters using Keras TimeDistributed and this implementation you see that it increase quite a lot. Seems like Pytorch creates all time_step times layers.

tom · March 3, 2020, 8:41pm

For an equivalent of Keras TimeDistributed you want indeed just a single module.
Yes, as you noted, duplicating the module might not be the right thing.

As an aside, there is an old issue requesting this feature: https://github.com/pytorch/pytorch/issues/1927

Best regards

Thomas

Cristina_Segalin · March 3, 2020, 9:06pm

Yes, using that old issue implementation saves on the number of parameters but I am not sure is the right way as well as it is basically reshaping the input to be able to apply the module and once applied reshaping to the original shape of the input. so you might lose some temporal information in this way (?) And the purpose of the timedistributed module.

I will train two models using both versions and see which leads to better results or makes more sense

simaiden · March 3, 2020, 9:34pm

What is the main focus of this implementation and how it differ from nn.Conv3d?

Tethys_Sun · June 6, 2020, 9:21pm

Hey @ilyes, I think your implementation is different from Keras TimeDistributed layer. You duplicated the layers for each time step, which means different processing going on at different time steps. However, when I tried Keras TimeDistributed layer, the number of trainable parameters are different from your implementation:

Am I missing something?

tcapelle · July 15, 2020, 12:38pm

I implemented a version myself that supports multiple arguments:

#export
class TimeDistributed(nn.Module):
    "Applies a module over tdim identically for each step" 
    def __init__(self, module, low_mem=False, tdim=1):
        super(TimeDistributed, self).__init__()
        self.module = module
        self.low_mem = low_mem
        self.tdim = tdim
        
    def forward(self, *args, **kwargs):
        "input x with shape:(bs,seq_len,channels,width,height)"
        if self.low_mem or self.tdim!=1: 
            return self.low_mem_forward(*args)
        else:
            #only support tdim=1
            inp_shape = args[0].shape
            bs, seq_len = inp_shape[0], inp_shape[1]   
            out = self.module(*[x.view(bs*seq_len, *x.shape[2:]) for x in args], **kwargs)
            out_shape = out.shape
            return out.view(bs, seq_len,*out_shape[1:])
    
    def low_mem_forward(self, *args, **kwargs):                                           
        "input x with shape:(bs,seq_len,channels,width,height)"
        tlen = args[0].shape[self.tdim]
        args_split = [torch.unbind(x, dim=self.tdim) for x in args]
        out = []
        for i in range(tlen):
            out.append(self.module(*[args[i] for args in args_split]), **kwargs)
        return torch.stack(out,dim=self.tdim)
    def __repr__(self):
        return f'TimeDistributed({self.module})'

you use it like this:

tdconv = TimeDistributed(nn.Conv2d(2, 5, 3, 1, 1), tdim=1)

and then feed a tensor with dimension: bs, seq_len, ch, h, w, you have to tell in which dim is the distribution of time:

tdconv(torch.rand(3, 10, 2, 8, 8))

It has two forwards, one that send over all the computation over the batch and another one using a for loop.

I would not mind some help to convert this to torch.script

Tethys_Sun · July 25, 2020, 8:07pm

Hi there, I’m not sure that using for-loop will save the memory… I think those two will have the same memory cost in training.

I wrote one for myself too…

class TimeDistributed(nn.Module):

    def __init__(self, module, batch_first=False):
        super(TimeDistributed, self).__init__()
        self.module = module
        self.batch_first = batch_first

    def forward(self, x):
        ''' x size: (batch_size, time_steps, in_channels, height, width) '''
        batch_size, time_steps, C, H, W = x.size()
        c_in = x.view(batch_size * time_steps, C, H, W)
        c_out = self.module(c_in)
        r_in = c_out.view(batch_size, time_steps, -1)
        if self.batch_first is False:
            r_in = r_in.permute(1, 0, 2)
        return r_in

tcapelle · July 27, 2020, 7:11am

They don’t.
Try passing a resne50 with a large batch size. Also, the unbinded forward can deal with non contigous memory tensors that the simple view forward can’t.
I really want a ConsLSTM/ConvGRU pytorch native module…

Tethys_Sun · July 27, 2020, 3:39pm

Thanks for your reply. Indeed, the unbinded forward can deal with non-contigous memory…

3nomis · July 1, 2021, 3:19pm

I haven’t good it yet. Is this module you created working correctly? the reshaping worked as expected?

LANCELOT_ZHANG · March 22, 2022, 6:54pm

I think for the case of Conv layers, the @ilyes implementation is different from the one in Keras, as doc in Keras mentions that all timestamps shares the weights, but the previous implementation create new instances for each timestamp. I would prefer the reshape one.