Dataset of uneven lengths

vdw · May 14, 2025, 1:21am

At the end of the day, your tensors must be “full”, i.e., they cannot contain arrays of different lengths. Even if you create your own Dataset class that will returns a list of tensors like that, you would still need to convert them into a single full tensor before giving it to a network.

Padding with zeros is a common best practice and you can use PackSequence to make the network ignore the padding. Alternatively, you can write your own Sampler to organize your dataset such data each batch only contains sequences of the same length; see here. I also have a more elaborate Jupyter notebook that goes through this. Maybe useful.