I am thinking about doing a 1d convolution over token embeddings matrix (over time dimension) and then just max pool it. After max pooling, sequences of any length will become a vector of fixed length. So I want my learning and inference steps both to be able to take sequences o any length.
Is it possible to do in Pytorch? I mean I know it is possible, but I am trying to figure out the best way to do it. Have somebody tried it yet?
I’m unsure if you are asking about the pooling operation creating the outputs in a fixed size, but it so you could use
nn.AdaptiveMaxPool2d and define the desired output size.
Thank you, this will definitely help! Should I just use Python build ins like lists/tuples to represent a batch of sentences of varied length? Will it slow the training down significantly?
Different sample sizes in a single batch won’t work out of the box and the common approach would be to either pad them to the longest sequence in the batch or to use the experimental
Note that the latter is used in some internal transformer layers, but I don’t know how well other modules are supported. @vdw also provided an approach here where samples with the same length are sampled into a single batch to avoid padding in case that’s interesting for your use case.
As @ptrblck said, within a batch you need to ensure that all sequence have the same length, either through padding or through some smarter organization (cf. linked post).
This is not correct for standard max pooling. As max pooling basically also just slides a window over you vector and aggregates the values within that window. The longer your sequence, the more sliding, the more output values. See this slide:
You might be referring to 1-max pooling, where the window size is the sames as the sequence length:
In this case, yes, different batches wit different sequence length would result in the same outputs, and you wouldn’t need to worry the dimensions of subsequent layers. For example, when you search on Google for text classification with CNNs, you very likely stumble on this architecture:
Note that this architecture uses indeed 1-max pooling, which is not the same as the standard max pooling.
I hope this helps.
Thank you my friend! Yes, I do 1-Max pooling, but without standardizing the input sequence length it’s still a bit problematic to implement. The easy choice is to standardize lengths using padding, so I decided to stick to it for now, instead of using something like NestedTensor
Another approach you can try is subgroup your data by sequence size and pad accordingly. Then call batches from each subgroup. For example:
- sequence sizes 1 to 32 would be padded to 32 and grouped
- sequence sizes 33 to 64 padded to 64 and grouped
The above would allow you to still perform down-sampling or pooling operations before your final average pooling.