Can I create a CNN sentence classifier without standardizing (padding) a sentence length?

vprokopev · January 23, 2023, 11:27am

I am thinking about doing a 1d convolution over token embeddings matrix (over time dimension) and then just max pool it. After max pooling, sequences of any length will become a vector of fixed length. So I want my learning and inference steps both to be able to take sequences o any length.

Is it possible to do in Pytorch? I mean I know it is possible, but I am trying to figure out the best way to do it. Have somebody tried it yet?

Thank you!

ptrblck · January 24, 2023, 1:34am

I’m unsure if you are asking about the pooling operation creating the outputs in a fixed size, but it so you could use nn.AdaptiveMaxPool2d and define the desired output size.

vprokopev · January 24, 2023, 8:46am

Thank you, this will definitely help! Should I just use Python build ins like lists/tuples to represent a batch of sentences of varied length? Will it slow the training down significantly?

ptrblck · January 24, 2023, 9:03am

Different sample sizes in a single batch won’t work out of the box and the common approach would be to either pad them to the longest sequence in the batch or to use the experimental NestedTensor support.
Note that the latter is used in some internal transformer layers, but I don’t know how well other modules are supported. @vdw also provided an approach here where samples with the same length are sampled into a single batch to avoid padding in case that’s interesting for your use case.

vdw · January 24, 2023, 9:55am

As @ptrblck said, within a batch you need to ensure that all sequence have the same length, either through padding or through some smarter organization (cf. linked post).

This is not correct for standard max pooling. As max pooling basically also just slides a window over you vector and aggregates the values within that window. The longer your sequence, the more sliding, the more output values. See this slide:

You might be referring to 1-max pooling, where the window size is the sames as the sequence length:

In this case, yes, different batches wit different sequence length would result in the same outputs, and you wouldn’t need to worry the dimensions of subsequent layers. For example, when you search on Google for text classification with CNNs, you very likely stumble on this architecture:

Note that this architecture uses indeed 1-max pooling, which is not the same as the standard max pooling.

I hope this helps.

vprokopev · January 24, 2023, 12:05pm

Thank you my friend! Yes, I do 1-Max pooling, but without standardizing the input sequence length it’s still a bit problematic to implement. The easy choice is to standardize lengths using padding, so I decided to stick to it for now, instead of using something like NestedTensor

J_Johnson · January 25, 2023, 12:33am

Another approach you can try is subgroup your data by sequence size and pad accordingly. Then call batches from each subgroup. For example:

sequence sizes 1 to 32 would be padded to 32 and grouped
sequence sizes 33 to 64 padded to 64 and grouped
etc.

The above would allow you to still perform down-sampling or pooling operations before your final average pooling.