How to make output of CNN to be input of RNN Layer?

Hi,
I am new to CNN, RNN and deep learning.
I am trying to make architecture that will combine CNN and RNN.
input image size = [20,3,48,48]
a CNN output size = [20,64,48,48]
and now i want cnn ouput to be RNN input
but as I know the input of RNN must be 3-dimension only which is [seq_len, batch, input_size]
How can I make 4-dimensional [20,64,48,48] tensor into 3-dimensional for RNN input?

You would have to decide which dimension(s) should be the temporal dimension (seq_len) and which the features (input_size).
E.g. you could treat the output channels as the features and the spatial dimensions (height and width) as the temporal dimension.
To do so, you could first flatten the spatial dimensions via:

output = output.view(output.size(0), output.size(1), -1)

and then permute the dimensions via:

output = output.permute(2, 0, 1)
2 Likes

But what if we use batch_first=True in the RNN layer? @ptrblck

In that case, the permutation should be:

output = output.permute(0, 2, 1)

since the RNN expects the input as [batch_size, seq_len, features].

2 Likes

Many thank @ptrblck
After your solution my Input size change to[2304, 20, 64] after though RNN layer my output is the same size. So I try to reshape back to [20, 64, 48, 48] because my Network is CNN > RNN > CNN.
But somehow after thought RNN layer. some data in tensor have changed.
While Iā€™m trying take the output back in CNN I got this error

TypeError: conv2d(): argument 'input' (position 1) must be Tensor, not tuple

Do you have any idea about this problem?

nn.RNN outputs a tuple as output, hidden_state as described in the docs, so you might want to use only one of these outputs for further processing.

1 Like

hi
i am trying to feed output features of my 3DCNN to gru
inputs to my 3DCNN are videos processed and stored in numpy arrays of the shape
[70, 1, 29, 88, 88]where each dimension corresponds to [batch size , num of channels , number of frames, Hight , width]

here is my 3DCNN
class CNN3D(nn.Module): #classes_num
def init(self, t_dim=29, img_x=88, img_y=88, drop_p=0.2, fc_hidden1=256, fc_hidden2=128, num_classes=2):
super(CNN3D, self).init()

    # set video dimension
    self.t_dim = t_dim
    self.img_x = img_x
    self.img_y = img_y
    # fully connected layer hidden nodes
    self.fc_hidden1, self.fc_hidden2 = fc_hidden1, fc_hidden2
    self.drop_p = drop_p
    self.num_classes = num_classes
    self.ch1, self.ch2 = 32, 48
    self.k1, self.k2 = (5, 5, 5), (3, 3, 3)  # 3d kernel size
    self.s1, self.s2 = (2, 2, 2), (2, 2, 2)  # 3d strides
    self.pd1, self.pd2 = (0, 0, 0), (0, 0, 0)  # 3d padding

    # compute conv1 & conv2 output shape
    self.conv1_outshape = conv3D_output_size((self.t_dim, self.img_x, self.img_y), self.pd1, self.k1, self.s1)
    self.conv2_outshape = conv3D_output_size(self.conv1_outshape, self.pd2, self.k2, self.s2)

    self.conv1 = nn.Conv3d(in_channels=1, out_channels=self.ch1, kernel_size=self.k1, stride=self.s1,
                           padding=self.pd1)
    self.bn1 = nn.BatchNorm3d(self.ch1)
    self.conv2 = nn.Conv3d(in_channels=self.ch1, out_channels=self.ch2, kernel_size=self.k2, stride=self.s2,
                           padding=self.pd2)
    self.bn2 = nn.BatchNorm3d(self.ch2)
    self.relu = nn.ReLU(inplace=True)
    self.drop = nn.Dropout3d(self.drop_p)
    self.pool = nn.MaxPool3d(2)
    self.fc1 = nn.Linear(self.ch2 * self.conv2_outshape[0] * self.conv2_outshape[1] * self.conv2_outshape[2],#self.ch2 * self.conv2_outshape[0] * self.conv2_outshape[1] * self.conv2_outshape[2]
                         self.fc_hidden1)  # fully connected hidden layer
    self.fc2 = nn.Linear(self.fc_hidden1, self.fc_hidden2)
    self.fc3 = nn.Linear(self.fc_hidden2, self.num_classes)  # fully connected layer, output = multi-classes

def forward(self, x_3d):
    #print(x_3d.shape) #[70, 1, 29, 88, 88]
    # Conv 1
    x = self.conv1(x_3d)
    x = self.bn1(x)
    x = self.relu(x)
    x = self.drop(x)
    print(x.shape) #[70, 32, 13, 42, 42] 
    # Conv 2
    x = self.conv2(x)
    x = self.bn2(x)
    x = self.relu(x)
    x = self.drop(x)
    #print(x.shape) #[70, 48, 6, 20, 20]
    # FC 1 and 2
    x = x.view(x.size(0), -1)
    #print(x.shape) #[70, 115200]
    x = F.relu(self.fc1(x))
    #print(x.shape) #[70, 256]
    x = F.relu(self.fc2(x))
    #print(x.shape) #[70, 128]
    x = F.dropout(x, p=self.drop_p, training=self.training)
    x = self.fc3(x)
    #print(x.shape) #[70, 2]

    return x

i tried to add gru layer before linear layers but i could not as gru expects input of shape (batch size , sequence length , input size) and my 3DCNN is reducing the number of frames 29

how to feed 3DCNN extracted features to gru ?