CNN + padding - what do you do about padding w a CNN for text classification?


I have a question about padding and the effects of this on a CNN text classification model. Let’s say I have a sentence with 4 words but because I want all my tensors in a batch to be the same size I use padding. So, for the example I might have a padded and unpadded sequence looking like: [1,2,3,4] and [1,2,3,4,0,0,0]; each int here represents some token, let’s say words, and I’ve padded by 3 zeros to have a size of 7 across all tensors in the batch. Suppose the vocabulary is of size 100 and each word vector gets dimension 5. Suppose I have 1 filter with a kernel of size 2 and then I want to apply max pooling across all the (sentence length) dimensions. It seems that if I don’t remove pooling, I’ll have some extra junk at the end, and this might screw up max pooling or mean pooling. Should you remove the pooling indices in such a model? Basically m != m_padded generally below. Is this something to be concerned about? I know that for RNN’s there’s pad sequence logic and utils, so my questions is similar but for CNN. It seems like technically speaking I am introducing padding without wanting to …

Thank you!

e = nn.Embedding(100, 5, padding_idx=0)
# 4 X 5 matrix
x = e(torch.tensor([1,2,3,4]))

# 7 X 5 matrix
x_padded = e(torch.tensor([1,2,3,4,0,0,0]))

# The filter.
f = nn.Conv1d(5, 1, 2)

# 1 X 3 matrix
z = f(x.t())

# 1 X 6
z_padded = f(x_padded.t())

# 1 X 1
m = nn.AvgPool1d(3)(z)

# 1 X 1
m_padded = nn.AvgPool1d(6)(z_padded)

In my experience, it’s okay to pad when using a CNN. Remember to pad to maximum sentence length in a batch. Assuming you use a dense layer as the final layer of the model, things should work fine.

Thank you! So you are saying that the performance is about the same as this unwanted effect gets washed out. But technically, this is an effect right? For example if you have a batch with a few long sentences and a lot more short sentences and you do average pooling at the end, final values (arrived by way of averaging) of the short sentences will get dragged down by the 0’s quite a bit, no? So, should we not remove the 0’s?

You’re welcome. The training phase should take care of those minor variations. The performance of a bigger batch is going to be better than a batch size of 1.