DataLoader can’t take tuples of tensors when the tensors are in different shapes

sajfb · March 15, 2023, 9:16pm

I’m currently working with a PyTorch model that has multiple tensor inputs. The first dimension of the tensors varies from sample to sample, and the second dimension remains fixed across all samples. To feed these tensors into the model, I’m using PyTorch’s DataLoader. Unfortunately, the DataLoader can’t handle tensors of varying sizes. Digging into the problem, I discovered that the DataLoader leverages the torch.stack function to batch data, and this function doesn’t support tensors of different shapes, which is causing issues with my implementation. Adding to the challenge is the fact that my data is already too sparse which making padding an unsuitable solution. I’m eager to hear any suggestions or solutions to this issue.

This is a subset of my data:

[((5, 4130, [1190, 1690, 3690, 2190, 690], 1.3434846529918303, array([0.36630037, 0.31868132, 0.22344322, 0.06227106, 0.02930403])), array([ 5.0804 , 1. , 1. , 37.3 , 7. ,
0. , 0. , 0.88888889, 0.53433167, 2.68817327])), ((7, 4630, [4190, 1690, 2190, 1190, 690, 2690, 4630], 1.6237995404291592, array([0.26954178, 0.2425876 , 0.20485175, 0.20215633, 0.04312668,
0.0296496 , 0.00808625])), array([ 5.7157 , 1. , 1. , 37.3 , 8. ,
0. , 0. , 0.9 , 0.42166914, 2.71347378])), ((8, 5130, [4690, 2190, 1190, 1690, 2690, 690, 3190, 5130], 1.7329349406273264, array([0.2994012 , 0.19461078, 0.17964072, 0.1497006 , 0.12275449,
0.03592814, 0.01197605, 0.00598802])), array([ 6.351 , 1. , 1. , 37.3 , 9. ,
0. , 0. , 0.90909091, 0.33192181, 2.74668001])), ((9, 5630, [5190, 2690, 2190, 1690, 1190, 3190, 690, 3690, 5630], 1.8170489183610905, array([0.34843206, 0.1533101 , 0.14982578, 0.12891986, 0.09059233,
0.07665505, 0.03484321, 0.01045296, 0.00696864])), array([ 6.9863 , 1. , 1. , 37.3 , 10. ,
0. , 0. , 0.91666667, 0.26990547, 2.78667036])), ((9, 6130, [5690, 1190, 1690, 2690, 2190, 3190, 3690, 690, 6130], 1.8265636837655168, array([0.35335689, 0.19081272, 0.1130742 , 0.1024735 , 0.09540636,
0.08127208, 0.03533569, 0.02120141, 0.00706714])), array([ 7.6216 , 1. , 1. , 37.3 , 11. ,
0. , 0. , 0.92307692, 0.23112829, 2.83256212])), ((10, 6629, [6190, 1190, 1690, 2190, 2690, 3190, 3690, 4190, 690, 6629], 1.8142003052692015, array([0.41841004, 0.14644351, 0.11297071, 0.08368201, 0.07531381,
0.06276151, 0.05439331, 0.0251046 , 0.0125523 , 0.0083682 ])), array([ 8.2569 , 1. , 1. , 37.3 , 12. ,
0. , 0. , 0.92857143, 0.20705395, 2.88364707]))]
My initial solution was to create an indexing function to group samples with the same tensor size into a list. However, the DataLoader throws an error when fed tuples, even though this solution worked with only tensor inputs.

ptrblck · March 16, 2023, 5:51am

Collecting samples with the same or similar size might be a good idea and @vdw shared some ideas and experiments here.