Naming Tensors in TensorDataset

ab-10 · February 12, 2020, 1:11pm

What are your thoughts on providing an option to name the individual tensors in the TensorDataset, similarly to how named tensors allow dimension naming?

The goal of this post is to form an early understanding of the sentiment towards the idea.

I believe that naming tensors in TensorDataset would be a logical extension to naming tensor dimension and would provide similar benefits to datasets as named dimensions provide to tensors.

ptrblck · February 12, 2020, 11:56pm

Could you post an example, how this naming would be used in a Dataset, please?

ab-10 · February 13, 2020, 5:42pm

Of course!

I’m thinking of something along the lines of:

import torch
from torch.utils.data import dataset
a = torch.randn(5, 10)
b = torch.randn(5,15)
named_tensor_dataset = dataset.TensorDataset(a,b,names=('embeddings', 'labels'))

# printing tensor names helps understand what data
# I'm dealing with and the order of that data
print(named_tensor_dataset.names)
>>> ('embeddings', 'labels')

for example in named_tensor_dataset:
    y = SampleNN(example.embeddings)
    err = criterion(y, example.labels)

ptrblck · February 13, 2020, 7:33pm

Thanks for the example.
An easy way of “named samples” would be to return a dict in your Dataset and use the key as the name.
Not sure, if a custom class implementation would work, and I would need to verify it.

ab-10 · February 13, 2020, 8:01pm

I was rather thinking of returning a collections.namedtuple, since it would allow to keep the current access by index in addition to access by name.

What do you think about the value of adding the feature to PyTorch?

ptrblck · February 15, 2020, 9:42am

How would the batching work with a namedtuple?
Wouldn’t the sampler be required to create the batch as a namedtuple, so that you could index it inside the DataLoader loop as e.g.:

for batch in loader:
    x, y = batch.data, batch.target

ab-10 · February 17, 2020, 4:43pm

I’m not sure why would the sampler need to create a namedtuple, since obtaining the number of namedtuples in the Dataset would suffice for generating random indices over them.

For batching the DataLoader should created a new namedtuple such that each element in the orginal corresponds to a tensor of batch size in the new one (as you have illustrated in the example).