Datasets, losses, and squeeze()

badger · November 2, 2022, 1:45am

I’ve been making datasets for some time and I always seem to run into the same problem–whenever its time for a loss function I have to use the .squeeze() method on my targets. It seems like I am doing something wrong in setting up my tensors. Yes I can get it to work, but it seems I am doing something subtley wrong.

example code:

from torch.utils.data import Dataset
## just a toy example, this happens even for pictures/classification/seg
class DataFrameDataset(Dataset):
    def __init__(self, X_dataFrame, y_dataFrame):
        self.X = X_dataFrame
        self.y = y_dataFrame

    def __len__(self):
        return self.X.shape[0]

    def __getitem__(self, idx):
        inputs = torch.tensor(self.X.iloc[idx,:]).float()
        targets = torch.tensor(self.y.iloc[idx]).long()
        return inputs, targets

Now any model I run (I was doing an Iris dataset example for class) for a loss I need to do the following:

y_hat = model(x_batch)
loss = nn.functional.cross_entropy(y_hat, y_batch.squeeze())

I know the .squeeze() is just changing my (batch x 1) dimesnional targets into a (batch) dimension of targets, but it seems I missed something in the dataset/dataloaders to do this “natively” with just a loss = nn.functional.cross_entropy(y_hat, y_batch)

ptrblck · November 2, 2022, 2:53am

Your code looks generally alright and I guess your y_dataFrame[idx] might return a 2D tensor, which you then transform to a PyTorch tensor. If so, just keep the .squeeze(1) to remove the unneeded dimension.