Best way to reduce dimensionality of Tensor (5D --> 4D)

Hey everyone,

i am currently working with convolutional recurrent units (ConvLSTM & ConvGRU).

Both expect as Input a Tensor of shape:
[batch_size, timestep, num_channels, height, width].

For further processing I need the tensor to be of shape:
[batch_size, num_channels, height, width].

In my scenario I get the timesteps from saving previous 2 results [t-2, t-1, t] and stacking them along the 1 dimension. The result after the Recurrent Unit is of the same shape.

What is the best way to get from
[ _, 3,_,_, _ ] to [ _, 1, _, _, _ ] such that i could do mytensor.squeeze(1)

I was thinking about doing:

my_output_tensor.shape()
# [8, 3, 2, 217, 512]
my_output_tensor = my_output_tensor[:,-1,:,:,:]
# [8, 2, 217, 512]

which gives me the output tensor for [t] but neglects the output of [t-2, t-1].

Alternatively I was thinking about appling a 1x1-3D Convolution along the timestep dimension to reduce the number of timestep features from 3 to 1.

I was wondering if there are any meaningful ways to reduce the dimensionality without loosing to much information :slight_smile:

Thanks in Advance!

Cheers,
Sven

Hey Sven,

So the way I see it is that we can either do one of two things:

  • Collapse a dimension into another dimension
  • Some sort of operation to reduce a dimension

A straight forward approach would be to do something like this: x = x[batch_size, -1, height, width]. This would collapse the timestep information into the channels. This might be problematic because future operations won’t strictly know about the temporal aspect, but there’s still some ordering to it all (timestep-1 features, timestep-2 features, etc.).

I do like the idea of a 3D convolution, I think it can do a good job of embedding your temporal dimension. You may also consider swapping the time dimension and num_channels dimension before the convolution. You can stride over time and reduce it that way (while also increasing your feature map). I would suggest the latter, I think it can do a better job with embedding the temporal information.

1 Like

Hey Alex, thanks for the answer! :blush:

what would be the benefit of swapping the time dimension and num_channels?

The 3d Conv. will be applied along the dim=1 if it is a 5D Vector, so if I swap it the “time” dimension I want to reduce, it would not be effected? Or did I get you wrong?

Could you explain what you mean with stride over time? Do you mean the stride size?

So the 3D convolution operator expects our input to be N x C x D x H x W, where N is our batch size, C is our channels, D is our depth, H is our height, and W is our width. Say that our kernel size is 3, that means for any given value in our output, we utilized a 3x3x3 volume in our DxHxW dimensions across all channels.

For example, suppose that A is our input tensor of size 1 x 8 x 5 x 3 x 3 and our kernel size is 3 with 0 padding and out channels is 8. Let f denote our convolution operation. These will be our following operations done:

  • f(A[:, :, 0:3, 0:3, 0:3])
  • f(A[:, :, 1:4, 0:3, 0:3])
  • f(A[:, :, 2:5, 0:3, 0:3])

The main things to notice here is that each operation takes in all of our channels and only a subset of our D x H x W dimensions. When I say “stride over time”, I really mean “stride over the depth dimension”.

Going back to your problem; we probably want to leave height and width as is. But there’s the possibility of swapping your timestep dimension and your num_channels dimension. The difference boils down to, do we want to compute our convolution across all of timestep or all of num_channels? Another way to ask this question is, do we want to stride over the num_channels dimension or the timestep dimension?

Depending on how we shape our input, the convolution operation will act differently. It might not make a big difference, but I think it’s something worth considering.

1 Like

I like the idea of stride over the depth dimension, never thought about something like that.
But I still don’t get how this helps to reduce dimensionality.

I build a small example:

import torch.nn as nn 
import torch

a = torch.rand((6, 3, 2, 217, 512)) # bs, timesteps, channels, H, W
a = a.transpose(1,2) # swap timesteps and channels
conv = nn.Conv3d(in_channels= 2, out_channels= 2, kernel_size=1, padding = 0)#<-- Out channels = 2
b= conv(a[:,:,0:1,:,:])
c= conv(a[:,:,1:2,:,:])
d= conv(a[:,:,2:3,:,:])
for i in [a,b,c,d]:
  print(i.shape)
#torch.Size([6, 2, 3, 217, 512])
#torch.Size([6, 2, 1, 217, 512])
#torch.Size([6, 2, 1, 217, 512])
#torch.Size([6, 2, 1, 217, 512])

I end up with b, c, d which are again 3 tensors I would need to stack which ends up at the original.
What I could do is something like:

a = torch.rand((6, 3, 2, 217, 512))
a = a.transpose(1,2)
conv = nn.Conv3d(in_channels= 2, out_channels= 1, kernel_size=1, padding = 0) #<-- Out channels = 1
b= conv(a[:,:,0:1,:,:])
c= conv(a[:,:,1:2,:,:])
d= conv(a[:,:,2:3,:,:])
e = torch.cat((b,c,d), dim=2).transpose(2,1)
for i in [a,b,c,d,e]:
  print(i.shape)
#torch.Size([6, 2, 3, 217, 512])
#torch.Size([6, 1, 1, 217, 512])
#torch.Size([6, 1, 1, 217, 512])
#torch.Size([6, 1, 1, 217, 512])
#torch.Size([6, 3, 1, 217, 512]) <--goal dimension

Do you have some paper/report/blogpost that investigated this striding? Or was this suggestion based on your personal experience?

To clarify, the convolution operation in general can be used to reduce dimensionality. But since we want to assure that the new representation contains the information we need, its useful to discuss the ways we can stride over the data.

Here’s an example I made to show the two methods I’ve mentioned:


import torch
import torch.nn as nn

def stride_over_channels(x):
    # Since our timesteps is in our "channels" position, we slowly
    # reduce it down to 1.
    # This is striding over our channels
    c1 = nn.Conv3d(in_channels=8, out_channels=4, kernel_size=3, padding=1)
    c2 = nn.Conv3d(in_channels=4, out_channels=1, kernel_size=3, padding=1)

    out1 = c1(x)            # batch-size x 4 x channels x height x width
    out2 = c2(out1)         # batch-size x 1 x channels x height x width
    out3 = out2.squeeze(1)  # batch-size x channels x height x width

    return out3

def stride_over_time(x):
    # Swap the dimensions to be batch-size x channels x timesteps x height x width
    # This causes us to stride over time
    x = x.permute((0, 2, 1, 3, 4))

    # Slowly reduce our channels down to 1
    c1 = nn.Conv3d(in_channels=16, out_channels=8, kernel_size=3, padding=1)
    c2 = nn.Conv3d(in_channels=8, out_channels=1, kernel_size=3, padding=1)

    out1 = c1(x)            # batch-size x 8 x timesteps x height x width
    out2 = c2(out1)         # batch-size x 1 x timesteps x height x width
    out3 = out2.squeeze(1)  # batch-size x timesteps x height x width

    return out3

def stride_over_time_2(x):
    # Swap the dimensions to be batch-size x channels x timesteps x height x width
    x = x.permute((0, 2, 1, 3, 4))

    # We can be even fancier and and build our channels up while
    # reducing our temporal dimension. This however requires more convolution
    # and thus more parameters
    c1 = nn.Conv3d(in_channels=16, out_channels=32, kernel_size=3, padding=(0, 1, 1))
    c2 = nn.Conv3d(in_channels=32, out_channels=64, kernel_size=3, padding=(0, 1, 1))
    c3 = nn.Conv3d(in_channels=64, out_channels=128, kernel_size=3, padding=(0, 1, 1))

    out1 = c1(x)            # batch-size x 32  x 6 x height x width
    out2 = c2(out1)         # batch-size x 64  x 4 x height x width
    out3 = c3(out2)         # batch-size x 128 x 2 x height x width

    # 2 isn't a nice size to perform a convolution on
    # So lets just collapse it at this point
    out4 = out3.view(-1, 256, 64, 64)

    return out4


# x is batch-size x timesteps x channels x height x width
x = torch.randn((4, 8, 16, 64, 64))

y1 = stride_over_channels(x)
y2 = stride_over_time(x)
y3 = stride_over_time_2(x)

print(y1.shape)  # 4 x 16  x 64 x 64
print(y2.shape)  # 4 x 8   x 64 x 64
print(y3.shape)  # 4 x 256 x 64 x 64

This is was a suggestion based off my personal experienced. I’ve worked with temporal-spatial data so the question of “how do we perform convolutions on this data?” has come up. I hope my example code clear things up for you. Let me know if anything is still confusing.

1 Like

Love the ideas you have! I will definetly try some out and see if there is a difference in applying the conv over the channels vs over the timestep dimension :slight_smile:
Thanks for your help!