Working with variable input sizes and increasing the batch_size, best approach?

I’m working with a transformer network at the moment, so my input has various sizes. Currently I have just been training with a batch_size of 1, but I want to change this to something higher.

I see two possible ways of doing this and I was wondering what the pro’s/con’s of the two methods are which is the best? and whether there are any alternatives I have missed.

Method 1)
I increase the batch_size in the dataloader, and will then have to apply some kind of padding transform to ensure that my data reaches a similar size in each batch.
This will likely be relatively fast, since it will enable me to get multiple inputs through my network at once, but I guess I will have to include a padding mask in my transformer so it doesn’t learn anything from the padding, which I am not sure how well will work?

Method 2)
I keep the batch_size at 1, but I accumulate the loss from several inputs before I compute the backprop and do an optimization step. In this way I won’t have to do any padding/masking. However I suspect it will be slower and there might be other pitfalls to this approach that I haven’t considered?

Does anyone have any experience with this? or good suggestions?