Different sample sizes in a single batch won’t work out of the box and the common approach would be to either pad them to the longest sequence in the batch or to use the experimental NestedTensor
support.
Note that the latter is used in some internal transformer layers, but I don’t know how well other modules are supported. @vdw also provided an approach here where samples with the same length are sampled into a single batch to avoid padding in case that’s interesting for your use case.