TensorDataset vs Customized Dataset

amirshamaei · September 23, 2023, 5:33am

Hi,

I had a piece of code for which I used TensorDataset. The input is one-dimensional signals. It does work perfectly.

self. Train = TensorDataset(self.y_trun, self)
self.val = TensorDataset(self.y_test_trun, self)

However, I decided to make it a bit cleaner for data augmentation. I created the following module:

self.train = MRSI_Dataset(self.y_trun, self)
self.val = MRSI_Dataset(self.y_test_trun, self)

import math

import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader


class MRSI_Dataset(Dataset):
    def __init__(self,data,engine):
        self.engine = engine
        # initialize dataset
        self.data = data
        self.t = torch.from_numpy(self.engine.t[0:data.shape[1]].T).float()
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self. Data[idx]    
        return sample

The problem is that using my customized dataset makes the result worse quantitatively. It’s like there is some reduced precision. I was wondering if there is any difference between TensorDataset and the customized dataset.

ptrblck · September 23, 2023, 1:11pm

TensorDataset will just index all passed tensors and return these as seen here.
Your custom Dataset won’t work since you are indexing an unknown self.Data attribute. Also, what is self.t used for?

amirshamaei · September 25, 2023, 4:55pm

Thank you. I modified TensorDataset. here is the working piece of code:

class MRSI_Dataset(Dataset[Tuple[Tensor, ...]]):
    r"""Dataset wrapping tensors.

    Each sample will be retrieved by indexing tensors along the first dimension.

    Args:
        *tensors (Tensor): tensors that have the same size of the first dimension.
    """
    tensors: Tuple[Tensor, ...]

    def __init__(self, *tensors: Tensor, engine ) -> None:
        assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors), "Size mismatch between tensors"
        self.tensors = tensors
        self.engine = engine
        self.t = torch.from_numpy(self.engine.t[0:tensors[0].shape[1]].T).float()
    def __getitem__(self, index):
        return tuple(self.get_augment(tensor[index]) for tensor in self. Tensors)

    def __len__(self):
        return self. Tensors[0].size(0)

I am so sorry for not giving a full description, self.t is a time vector for the augmentation process.(sampling).

Before getting your answer, I found a workaround to use vmap for augmentation in the training step as follows:

self.getaug_vmap = torch.vmap(self.get_augment, in_dims=(0, None, None, None, None),randomness='different')

Apparently, it is faster. I appreciate it if you comment on this approach. I’m wondering whether it is a standard method or if I should use a custom dataset.

ptrblck · September 26, 2023, 12:54am

I don’t fully understand the indexing as it seems you are indexing each tensor with the passed index while I would assume the self.Tensors object would be indexed. Could you explain what exactly is stored in self.Tensors?

amirshamaei · September 30, 2023, 6:14am

Thank you.
self. Tensor is a 2D matrix (m times n) in which each row is an array (m signals with the length of n).
When I modified TensorDataset, it worked. Does tuple(…) make difference?