Understanding tensor sizes in Dataloader

badger · March 19, 2025, 3:04am

I have workarounds but I suspect there is something fundamental I am missing.

If I build a simple dataloader using a pandas array as input, I can never get the dimensions quite right–I always have to squeeze and unsqueeze for loss functions. Example:

Class CustomSimpleDataset(Dataset):
    def __init__(self, featureDataFrame, targetDataFrame):
        self.features = featureDataFrame
        self.targets = targetDataFrame
    def __len__(self):
        return len(self.features)
    def __getitem__(self, idx):
        feature = self.features.iloc[idx]   #select a row from the dataframe
        feature = torch.tensor(feature, dtype=torch.float32) #turn that row into a tensor
        target = self.targets.iloc[idx]
        target = torch.tensor(target, dtype=torch.long) 
        return feature, target  

test_dataset = CustomSimpleDataset(X_test, y_test)
test_dataloader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

loss=nn.BCEWithLogitsLoss()
loss(my_lightning_model.forward(i),j.unsqueeze(dim=1))  #requires either a squeeze or unsqueeze!!!

ptrblck · March 19, 2025, 4:56am

Could you post a minimal and executable code snippet reproducing the issue using random data?

badger · March 20, 2025, 2:06am

import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from torch.utils.data import DataLoader
from torch import nn
import os
import pandas as pd

class CustomDataset(Dataset):
    def __init__(self, featureDataFrame, targetDataFrame):
        self.features = featureDataFrame
        self.targets = targetDataFrame
    def __len__(self):
        return len(self.targets)
    def __getitem__(self, idx):
        feature = self.features.iloc[idx]   #select a row from the dataframe
        feature = torch.tensor(feature, dtype=torch.float32) #turn that row into a tensor
        target = self.targets.iloc[idx]
        target = torch.tensor(target,dtype=torch.long)
        return feature, target  #you can return more, but generally return your features and targets.

data = fetch_california_housing(as_frame=True)
X = data['data']
y= data['target']
test_dataset = CustomDataset(X, y)
test_dataloader = DataLoader(test_dataset, batch_size=10, num_workers=10, shuffle=False)
for x,y in test_dataloader:
    print(x.shape,x.dtype)
    print(y.shape,y.dtype)
    break

class SimpleNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(8,10),
            nn.ReLU(),
            nn.Linear(10, 1),
        )
    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits
        
model = SimpleNetwork()

loss = nn.MSELoss()
for x,y in test_dataloader:
    p = model(x)
    print(p.dtype)
    print(y.dtype)
    print(p.shape)
    print(y.shape)
    print("This gives a warning")
    print(loss(p,y)) # gives a warning
    print("This does not!")
    print(loss(p,y.unsqueeze(dim=1))) #no warning
    break

Classification Example that fails:

class CustomDataset(Dataset):
    def __init__(self, featureDataFrame, targetDataFrame):
        self.features = featureDataFrame
        self.targets = targetDataFrame
    def __len__(self):
        return len(self.targets)
    def __getitem__(self, idx):
        feature = self.features.iloc[idx]   #select a row from the dataframe
        feature = torch.tensor(feature, dtype=torch.float32) #turn that row into a tensor
        target = self.targets.iloc[idx]
        target = torch.tensor(target,dtype=torch.float)
        return feature, target  #you can return more, but generally return your features and targets.

data = load_breast_cancer(as_frame=True)
X = data['data']
y= data['target']
test_dataset = CustomDataset(X, y)
test_dataloader = DataLoader(test_dataset, batch_size=10, num_workers=10, shuffle=False)
for x,y in test_dataloader:
    print(x.shape,x.dtype)
    print(y.shape,y.dtype)
    break

class SimpleNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(30,10),
            nn.ReLU(),
            nn.Linear(10, 1),
        )
    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits
        
model = SimpleNetwork()

loss = nn.BCEWithLogitsLoss()
for x,y in test_dataloader:
    p = model(x)
    print(p.dtype)
    print(y.dtype)
    print(p.shape)
    print(y.shape)
    try:
        print(loss(p,y)) # gives a warning
    except ValueError as e:
        print("This doesn't work, wrong dims!")
    try:
        print("Unsqueeze works fine")
        print(loss(p,y.unsqueeze(dim=1))) #no warning
    except Error as e:
        pass
    break

I realize pandas throws all those warnings about indicies, but our grading sheets regularly come in csv which lend themselves to using pandas to transform data into tensors.

badger · March 20, 2025, 2:12am

Also note that one gets different (but closely related) answers in the Regression/MSE example:
This gives a warning
tensor(4854.9785, grad_fn=)
This does not!
tensor(4872.5762, grad_fn=)

ptrblck · March 20, 2025, 9:36pm

>>> test_dataloader = DataLoader(test_dataset, batch_size=10, num_workers=10, shuffle=False)
>>> for x,y in test_dataloader:
...     print(x.shape,x.dtype)
...     print(y.shape,y.dtype)
...     break
... 
torch.Size([10, 30]) torch.float32
torch.Size([10]) torch.float32

returns the expected shapes with a batch_size of 10.

The warning is raised in the loss function as nn.BCEWithLogitsLoss expects the model output and target to have the same shape as described in the docs, so you should either unsqueeze dim1 inside the dataset or the training loop as already done.
Keep in mind that BCEWithLogitsLoss is used for binary or multi-label classification use cases, where the latter use case allows you to classify each sample into zero, one, or multiple classes. Since your model output has the shape [batch_size, 1], it seems you are working on a binary classification so would thus need to unsqueeze the missing “class dimension”.

badger · March 21, 2025, 11:59pm

Fair enough, my question was why the data loader was giving the wrong matrix size ([10] vs [10,1]) for these trivial examples. Of course they are basically the same (but not quite) and that’s why i thought there was some key detail I was missing. I realize the answer is to squeeze/unsqueeze, I was under the impression they should match without matrix manipulation (for this case).

ptrblck · March 22, 2025, 2:25am

The indexing inside the __getitem__ creates scalar tensors for the target (0-dim tensors) and the DataLoader will then batch them and add the missing batch dimension. You could try to use a slicing approach in pandas to create actual arrays but calling unsqueeze might actually be easier.