Handle variable sized sentences for text classification with CNN(+RNN)

Hi all,

I’m trying to implement a text classifier using Conv1d. My dilemma is - how do we construct the linear layer to consume the output from the final CNN/MaxPool layers.

I can pad all the sentences in a batch to have the same length as the longest sentence in that batch and set the input to linear layer accordingly but if the next batch has a different seq_len, then I’ll still get an error.

Eg:

batch_size = 2

Batch 1:
[1, 2, 3, 4, 5]
[1, 2, 3, 0, 0]

Batch 2:
[1, 2, 3, 4, 5, 6, 7, 8]
[1, 2, 3, 4, 5, 0, 0, 0]

# The CNN/Maxpool output for both these batches would be different. 
# As a result, when I flatten this out to feed into the linear layer,
# the number of neurons in the linear layer would have to be different. 

The only idea I have right now is to pad all the sentences in my dataset to have the same length(mean/median of sentence lengths) and pad/remove the words from each sentence according to the length criteria.

Is there any other way to get around this?

Edit:
How would we go about implementing:

  • CNN -> Linear
  • CNN -> RNN -> Linear

The typical thing is to do a reduction over the time axis, like max or mean.
fast.ai’s (with the reference being Howard and Ruder’s ULMFiT paper) advocate using max, mean andlast element for RNNs. As for CNNs the last element is hard / not that meaningful, you would concatenate max and mean.

Best regards

Thomas

Please correct me if I understood this wrong.

Say, the penultimate layer after multiple Conv1D and Maxpool operations is of size (10, 100, 5) where:

batch_size = 10
filters(out_channel) = 100
seq_len = 5

CNN -> Linear
So, I will obtain the max (or mean) of the last dimension ie. the one with 5 elements. So, the shape now would be (10, 100). This (10, 100) layer will now be flattened into (1, 1000) layer and passed into the linear layer to obtain the output ie. a single class.

CNN -> RNN -> Linear
How would I go about doing something similar for CNN->RNN->Linear?

CNN -> Linear:

No. The batch size should never be flattened.
You keep (10, 100) and feed it into a (100, 1) linear (or whatever) to get a 10, 1 - one prediction per batch item.
You could concatenate the (10, 100) max with the mean (over the seq_len=5) to get a (10, 200).

CNN -> RNN -> Linear

Here you would feed (after permuting the axes) the (5, 10, 100) into the RNN (say 100->100) and then get something of the same shape.
Then you could take torch.concat([o.max(0), o.mean(0), o[1], -1) of shape 10, 300 and feed to a linear as above.

Best regards

Thomas

1 Like