Representing heterogeneous data in memory

michalsustr · March 3, 2021, 10:18am

Suppose I have a training set of set of sequences of variable size, i.e.

D = { (x^i, y^i) | i \in (1...N)}
where
x^i = { x^{i,j}  | j \in (1...N_i)}  and  y^i \in \R^N_i
where
x^{i,j} \in \R^N_{i,j}

with some sizes N_i, N_{i,j}.

What is the best way to save them in memory? Does torch::Tensor allow to save block of heterogeneous data via pointers to (dynamically allocated) other torch::Tensors, which would hold the actual data? (And do so possibly recursively?) Most importantly, does it allow all the nice PyTorch things like tracking gradients?

While I could manage this by storing tensors into std::vector<torch::Tensor> (and do so recursively), I lose the ability to use the torch interface when interacting with the data.

What is the recommended way to do this? I imagine similar problems arise often for graph neural networks.

tom · March 3, 2021, 12:08pm

What you describe is called NestedTensors in PyTorch parlance and it isn’t there yet.

Until that materializes, the options are mostly padding (to regular shape) or packing. Either might be good to combine with “stratifying” your batches to have similar sizes, e.g. torchtext does this, I have done that before for CT scans.
Whether packing or batching is more appropriate and how the stratification is best done depends on your data and your task.

Best regards

Thomas

michalsustr · March 3, 2021, 1:33pm

Great, thank you for the pointers!