Computing FC Layer Input Size for 1D Text CNN with Multiple CNN Layers

Hello,

I am translating some Keras code into PyTorch. Unfortunately I’m finding it hard to compute the input size for the fully connected layer in my multi-layer CNN. I understand that PyTorch expects this to be passed explicitly and that the docs for nn.Conv1d and nn.MaxPool1d provide formulas to compute the input size. Would someone be willing to check that I’ve implemented these formulas correctly? Note that kernel_sizes is a list of kernel sizes for each nn.Conv1d layer. Thanks very much for any assistance.

class DeepCNN(nn.Module):

    def __init__(
        self,
        n_classes,
        max_seq_len,
        embedding_matrix,
        kernel_sizes,
        conv_dropout_rates,
        n_fc_neurons_l0,
        n_fc_neurons_l1,
        fc_dropout_rate_l0,
        fc_dropout_rate_l1,
        pad_idx
    ):
        super(DeepCNN, self).__init__()
        
        layers, drops, pools = [], [], []
        embedding_dim = embedding_matrix.shape[1]
        
        # Embedding layer
        self.embedding = nn.Embedding.from_pretrained(
            embeddings=embedding_matrix,
            freeze=False,
            padding_idx=pad_idx,
            max_norm=None,
            norm_type=2,
            scale_grad_by_freq=False,
            sparse=False
        )
        
        # Conv layer 0
        layers.append(
            nn.Conv1d(
                in_channels=embedding_dim,
                out_channels=embedding_dim,
                kernel_size=kernel_sizes[0],
                stride=1,
                padding=0
            )
        )
        drops.append(
            nn.Dropout(p=conv_dropout_rates[0])
        )
        pools.append(
            nn.MaxPool1d(
                kernel_size=2,
                stride=None,
                padding=0
            )
        )
        
        # Conv layer 1
        layers.append(
            nn.Conv1d(
                in_channels=embedding_dim,
                out_channels=2 * embedding_dim,
                kernel_size=kernel_sizes[1],
                stride=1,
                padding=0
            )
        )
        drops.append(
            nn.Dropout(p=conv_dropout_rates[1])
        )
        pools.append(
            nn.MaxPool1d(
                kernel_size=2,
                stride=None,
                padding=0
            )
        )
        
        # Conv layer 2
        layers.append(
            nn.Conv1d(
                in_channels=2 * embedding_dim,
                out_channels=2 * embedding_dim,
                kernel_size=kernel_sizes[2],
                stride=1,
                padding=0
            )
        )
        drops.append(
            nn.Dropout(p=conv_dropout_rates[2])
        )
        pools.append(
            nn.MaxPool1d(
                kernel_size=2,
                stride=None,
                padding=0
            )
        )
        
        # Set conv layers with dropout and pooling
        self.layers = layers
        self.drops = drops
        self.pools = pools
        
        # Set fc layers with dropout
        self.fc0 = nn.Linear(self.num_feat(embedding_dim, kernel_sizes) * 2 * embedding_dim, n_fc_neurons_l0)
        self.drop0 = nn.Dropout(p=fc_dropout_rate_l0)
        self.fc1 = nn.Linear(n_fc_neurons_l0, n_fc_neurons_l1)
        self.drop1 = nn.Dropout(p=fc_dropout_rate_l1)
        self.output = nn.Linear(n_fc_neurons_l1, n_classes)
        
    def num_feat(self, embedding_dim, kernel_sizes):
        
        padding = 0
        stride = 1
        dilation = 1
        max_pool_kernel_size = 2
        
        out_conv_0 = math.floor(((embedding_dim + 2 * padding - dilation * (kernel_sizes[0] - 1) - 1) / stride) + 1)
        out_pool_0 = math.floor(((out_conv_0 + 2 * padding - dilation * (max_pool_kernel_size - 1) - 1) / stride) + 1)
        
        out_conv_1 = math.floor(((embedding_dim + 2 * padding - dilation * (kernel_sizes[1] - 1) - 1) / stride) + 1)
        out_pool_1 = math.floor(((out_conv_1 + 2 * padding - dilation * (max_pool_kernel_size - 1) - 1) / stride) + 1)
        
        out_conv_2 = math.floor(((2 * embedding_dim + 2 * padding - dilation * (kernel_sizes[2] - 1) - 1) / stride) + 1)
        out_pool_2 = math.floor(((out_conv_2 + 2 * padding - dilation * (max_pool_kernel_size - 1) - 1) / stride) + 1)

        return out_pool_2

    def forward(self, input):
        
        # Get embeddings from input tokens
        # Output shape: (batch_size, max_seq_len, embedding_dim)
        x = self.embedding(input)
        
        # Permute embedding output to match input shape requirement of nn.Conv1d
        # Output shape: (batch_size, embedding_dim, max_seq_len)
        x = x.transpose(1, 2)
        
        # Run through CNN layers
        for layer, drop, pool in zip(self.layers, self.drops, self.pools):
            x = F.relu(layer(x))
            x = drop(x)
            x = pool(x)
            
        # Convert from (batch size, filters, window size) to (batch size, filters * window size)
        x = x.view(x.size(0), -1)
        
        # Trying to compute and pass this input shape in the num_feat function to the fc layers
        print(x.shape[1])
        
        # Run through fully connected layers
        x = F.relu(self.fc0(x))
        x = self.drop0(x)
        x = F.relu(self.fc1(x))
        x = self.drop1(x)
        logits = self.output(x)
        
        # Convert to class probabilities
        probs = torch.sigmoid(logits)
        
        return probs

If it works and it’s in line with the formulas given, it should be alright.

I would recommend putting the calculations into their own methods, though. There’s a lot of copy&paste going one which makes to annoying to make changes, e.g., adding more layers.

Maybe you can have a look at some older code of mine, particularly at the methods _calc_conv_output_size() and _calc_maxpool_output_size() and how/where they are used. I think this makes more flexible and cleaner coding.

1 Like

Thanks for your reply @vdw. The code you shared is very helpful.

The main thing missing from my fully connected layer input size calculation above is that I interpreted “L in” from the nn.Conv1d docs as the input channels size (in my case embedding_dim). I see your code uses the max sequence length as “L in” here.

I agree that my code above could be refactored to remove repeated lines. Here’s an updated version, organized much like the code you shared. I also borrowed the idea of concatenating the outputs of each CNN layer which I’ve seen in other code (follow-up question on this below if you have the chance/are interested).

class DeepCNN(nn.Module):
    """
    Borrows from code here:
    https://github.com/chrisvdweth/ml-toolkit/blob/master/pytorch/models/text/classifier/cnn.py
    """

    def __init__(
        self,
        n_classes,
        max_seq_len,
        embedding_matrix,
        kernel_sizes,
        conv_dropout_rates,
        n_fc_neurons_l0,
        n_fc_neurons_l1,
        fc_dropout_rate_l0,
        fc_dropout_rate_l1,
        pad_idx
    ):
        super(DeepCNN, self).__init__()
        
        self.embedding_dim = embedding_matrix.shape[1]
        self.flatten_size = 0
        self.conv_padding = 0
        self.conv_stride = 1
        self.maxpool_kernel_size = 2
        self.maxpool_padding = 0
        self.maxpool_dilation = 1
        
        # Would like to be able to increase these at each layer
        self.in_channels = [
            self.embedding_dim,
            self.embedding_dim,
            self.embedding_dim
        ]
        
        self.out_channels = [
            self.embedding_dim,
            self.embedding_dim,
            self.embedding_dim
        ]
        
        # Embedding layer
        self.embedding = nn.Embedding.from_pretrained(
            embeddings=embedding_matrix,
            freeze=False,
            padding_idx=pad_idx,
            max_norm=None,
            norm_type=2,
            scale_grad_by_freq=False,
            sparse=False
        )
        
        # Iterate through kernels and channels to set layers and compute output sizes
        self.conv_layers = nn.ModuleDict()
        self.dropout_layers = nn.ModuleDict()
        self.maxpool_layers = nn.ModuleDict()
        for i, (k, ic, oc, p) in enumerate(zip(kernel_sizes, self.in_channels, self.out_channels, conv_dropout_rates)):
            
            # Set conv layers
            self.conv_layers[f'conv_{i}'] = nn.Conv1d(
                in_channels=ic,
                out_channels=oc,
                kernel_size=k,
                stride=self.conv_stride,
                padding=self.conv_padding
            )
            
            # Set dropout layers
            self.dropout_layers[f'dropout_{i}'] = nn.Dropout(p=p)
            
            # Set maxpool layers
            self.maxpool_layers[f'maxpool_{i}'] = nn.MaxPool1d(
                kernel_size=self.maxpool_kernel_size,
                stride=self.maxpool_kernel_size,
                padding=self.maxpool_padding
            )
            
            # Calculate conv output size
            conv_out_size = self._calc_conv_output_size(
                seq_len=max_seq_len,
                kernel_size=k,
                stride=self.conv_stride,
                padding=self.conv_padding
            )
            
            # Calculate maxpool output size
            maxpool_out_size = self._calc_maxpool_output_size(
                seq_len=conv_out_size,
                kernel_size=self.maxpool_kernel_size,
                stride=self.maxpool_kernel_size,
                padding=self.maxpool_padding,
                dilation=self.maxpool_dilation
            )
            
            # Add output sizes
            self.flatten_size += maxpool_out_size
            
        # Compute final size of flattened input to fully connect layers
        self.flatten_size *= self.out_channels[-1]
        
        # Set fc layers with dropout
        self.fc0 = nn.Linear(self.flatten_size, n_fc_neurons_l0)
        self.drop0 = nn.Dropout(p=fc_dropout_rate_l0)
        self.fc1 = nn.Linear(n_fc_neurons_l0, n_fc_neurons_l1)
        self.drop1 = nn.Dropout(p=fc_dropout_rate_l1)
        self.output = nn.Linear(n_fc_neurons_l1, n_classes)
        
    def _calc_conv_output_size(self, seq_len, kernel_size, stride, padding):
        
        return int(((seq_len - kernel_size + 2 * padding) / stride) + 1)

    def _calc_maxpool_output_size(self, seq_len, kernel_size, stride, padding, dilation):
        
        return int(math.floor(((seq_len + 2 * padding - dilation * (kernel_size - 1) - 1) / stride ) + 1 ))

    def forward(self, input):
        
        # Get embeddings from input tokens
        # Output shape: (batch_size, max_seq_len, embedding_dim)
        x = self.embedding(input)
        
        # Permute embedding output to match input shape requirement of nn.Conv1d
        # Where embedding_dim = input channels
        # Output shape: (batch_size, embedding_dim, max_seq_len)
        x = x.transpose(1, 2)
        
        # Run through conv, dropout, and maxpool layers
        xs = []
        for i in range(0, len(self.conv_layers)):
            l_out = F.relu(self.conv_layers[f'conv_{i}'](x))
            l_out = self.dropout_layers[f'dropout_{i}'](l_out)
            l_out = self.maxpool_layers[f'maxpool_{i}'](l_out)
            l_out = l_out.view(l_out.size(0), -1)
            xs.append(l_out)
        
        # Concatenate conv output layers
        x = torch.cat(xs, 1)
        
        # Run through fully connected layers
        x = F.relu(self.fc0(x))
        x = self.drop0(x)
        x = F.relu(self.fc1(x))
        x = self.drop1(x)
        logits = self.output(x)
        
        # Convert to class probabilities
        probs = torch.sigmoid(logits)
        
        return probs

Two related follow-up questions:

  1. In the code above, you’ll see where I define self.in_channels and self.out_channels that I note how I would like to increase the out_channels size at each layer. I get errors like Given groups=1, weight of size [256, 256, 6], expected input[32, 128, 5000] to have 256 channels, but got 128 channels instead when I attempt to increase this. Is there a way to increase the size of the out_channels?
  2. @vdw, I see that you concatenate the output of each CNN + Maxpool operation in your code and pass these features to your fully connected layers. Other PyTorch code I’ve seen will run the input data through each layer and pass the final output to the fully connected layers. I assume these approaches are very different in that concatenation provides features from each layer representing various levels of abstraction, whereas passing the final layer would just provide the representation from the last layer. Am I thinking about that right?

@jstremme Sorry, yes…I’ve implement this approach of using CNNs for text classification which uses multiple CNN layers in parallel (not sequential) to handle different spans of words. Hence my concatenation of the Maxpool outputs instead of using them as input for a subsequent CNN layer. Otherwise, I rarely use CNNs for my tasks so I’m not overly familiar with all the trips and traps.

Makes perfect sense.

The code in your initial answer helped solve my problem of computing the FC layer input size, so I will mark it as the solution.

For anyone else reading, note that my first and second code snippets are different in exactly the way described in the thread (sequential vs concatenated), except the FC input size computation is wrong in the first snipped (also described above). I’ll see about posting a follow-up if I end up using the sequential approach.

Thanks again, @vdw.