Best practice masking variable sized input for convolution

I’ve got a sequence to sequence model that includes a convolutional stack in front of some recurrent layers. In my experimental code, I’m carrying around a lengths tensor in addition to the data, downsampling them after each layer before instantiating/applying the mask, and packing/unpacking the sequences if they are handed to an recurrent layer. That is awfully inconvenient and while still a bit faster than single samples not as fast as I was hoping.

Are there more elegant ways to deal with variable sized inputs in CNNs?