Variable size input for 2D to 1D output

So I am trying to implement a network which will take in an L x L input, where L is variable, and output an L sized 1D sequence. Now I know fully convolution networks can handle variable size input. In the network, the L x L goes through several convolutional layers, and then will be fed to an attention layer to output an L sized sequence. However I am not sure how an Attention mechanism can handle a variable size input. One solution I pondered is padding all the inputs to a fixed size, but that would be problematic as the L varies a lot, which might introduce noise and overfitting.

Another solution I pondered is taking mini-batches and padding them to a fixed length, however it seems that at different batches, the network will see inputs with variable L anyway.

What is the most elegant solution to variable size inputs in this case?