Dataset of uneven lengths

LionFrangoulis · May 12, 2025, 11:42am

I am trying to create a model for some chemistry application, however the number of atoms in each sample will eventually be different.
Assuming N datapoints, with A atoms each and a feature number per atom of F, so far my input is an (N,A) tensor of atom species, that is supposed to be encoded into a randomly initialised encoding that is later trained, resulting in a tensor (N,A,F). This I believe I can perform for various lengths by combining the first two dimensions, creating a list of trainable tensors (one for each element expected) and the results simply being a tensor containing the trainable tensors in the order according to the atoms. However, this essentially mixes the dimension that seperates samples and the dimension within one sample.

Is there a better way of performing these, without creating a bunch of zero tensors to unify the samples in size? (as proposed in How to create a dataloader with variable-size input - #2 by smth and many other posts)

vdw · May 14, 2025, 1:21am

At the end of the day, your tensors must be “full”, i.e., they cannot contain arrays of different lengths. Even if you create your own Dataset class that will returns a list of tensors like that, you would still need to convert them into a single full tensor before giving it to a network.

Padding with zeros is a common best practice and you can use PackSequence to make the network ignore the padding. Alternatively, you can write your own Sampler to organize your dataset such data each batch only contains sequences of the same length; see here. I also have a more elaborate Jupyter notebook that goes through this. Maybe useful.

LionFrangoulis · May 29, 2025, 4:53pm

Padding with zeros is certainly the easiest approach, however due to non-lineraities and the bias added in each linear layer, they have non zero contributions to the “real” data. I guess one could create custom masks that make sure these terms stay zero?

vdw · May 30, 2025, 12:25am

Sure, padding should usually always come with loss masking; I actually just created a notebook for that.