Padding applied even after giving padding=0 in nn.conv1d()

I have created this model :

def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)

a = nn.Module()
a.l1 = nn.Conv1d(90000, 300, 2, padding=0)
print(count_parameters(a))

It gives me output as 54000300. I think it should give only 53999700. The reason is, I’m not padding the sequence, and because the filter length is 2, our model should ignore the last input, as there is no any padded input node after last node. So technically, the output number of parameters should be (89999) * (2) * (300). But instead, the output is (90000)(2)(300). It does not make any difference if I pad the sequence or not. Even if I pad the sequence with padding=1 or padding=2, my answer remains unchanged, i.e. 54000300. Also in the formula described in the documentation to calculate the output dimension, it does not take into account the padding. It always counts from 0 to Cin-1.
It should count the sum by taking into account the filter_size. If it is more than 1, it should ignore the columns from the back accordingly. For example, if filter size is 2, then it can not take last column for convolution, since the sequence is not padded. Similarly, for filter size of 3, it should ignore last two columns because of the same reason.

I didn’t check your calculations, but maybe you can have a look at my code here: There are two methods _calc_conv_output_size and _calc_maxpool_output_size to calculate the sizes of the output of the conv layer and the max_pool layers – since I want to be flexible with the sequence lengths, kernel sizes, stride, etc.

Apart from that, nn.Conv1d(90000, 300, 2, padding=0) looks a bit surprising to me. From my understanding the set up is as follow:

nn.Conv1d(in_channels=embedding_dim,
          out_channels=out_channels,
          kernel_size=conv_kernel_size,
          stride=self.common_conv_stride,
          padding=self.common_conv_padding)

So your embedding_dim is 90000? The whole purpose of word embeddings is to get a lower dimensional representation of words. Memory might also be a problem here.

Uh, wait a minute. The number of parameters of a layer is not the same as the number of outputs. i.e., the output dimension. For example, there are bias parameters in the layer that out not reflected in the output dimension. You’re comparing two different numbers here.