I am lost on custom batch size definition


I have a problem understanding how I can define a Dataset/DataLoader combination that processes batches of a custom size. I have a tabular dataset with a categorical variable defining the batch. I define the dataset like

class MyDataset:

    def __init__(self, df, features, target):
        self.df = df
        self.features = features
        self.target = target
        self.category_var = list(self.df.category_var.unique())

    def __len__(self):
        return len(self.category_var)

    def __getitem__(self, idx):
        var = self.category_var[idx]
        X = self.df.query('category_var==@var')[self.features]
        y = self.df.query('category_var==@var')[self.target]
        return X,y

So each item in my dataset is a custom sized batch of samples I want to process.

When I defined a train_loader like

train_loader = torch.utils.data.DataLoader(
    MyDataset(df, features, target),  batch_size=None, batch_sampler=None

My code tanks with:

TypeError: 'int' object is not callable

Which does not give me an angle to work with.

I guess I am understanding the concept of custom dataset and dataloaders, maybe even the definition of a batch wrong.
Is my dataset meant to return a batch or a sample?
If sample: How can my dataset define a custom batch size if it only returns one sample? Is that not the job of the dataset? if not, whose is it? the data loaders? the collate_fn?

Thank you very much

In the standard use case your __getitem__ method will load, process, and return a single sample.

I’m unsure what “custom” batch size means. If you want to set the batch size to a specific value you could do so in the DataLoader by setting the batch_size argument.

Thank you very much for your reply.
I do not mean setting a custom batch_size argument, because that would be fixed for all batches. I want every one of my batches to have a different size.

In my example I have a variable (called category_var) which groups the data into batches and my __getitem__ wrongly returns one of these batches.

As Dataset.__getitem__ is not the place to do this. Do you have a hint on how to achieve this?

I think Dataset.__getitem__ could still be used to load multiple samples and you could then skip the batching from the DataLoader by setting batch_size=1 during its initialization.
I’m currently unsure what exactly is failing as your initial post doesn’t mention the line of code using your approach.

thx again. Do you know of a minimal working example (kind of the opposite of what you ask from me…) I could start off of? I really struggle to wrap my head around the interaction of Dataset, Dataloader, Sampler and Batches.

Sure, here is a simple example which uses pre-defined batch sizes in the Dataset:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(100, 1)
        self.target = torch.randint(0, 10, (100,))
        self.batch_sizes = [10, 20, 50, 15, 5]
        assert sum(self.batch_sizes) == self.data.size(0)
    def __len__(self):
        return len(self.batch_sizes)
    def __getitem__(self, index):
        batch_size = self.batch_sizes[index]
        offset = sum(self.batch_sizes[:index])
        print(f"loading {batch_size} samples at offset {offset}")
        x = []
        y = []
        for i in range(batch_size):
        x = torch.stack(x)
        y = torch.stack(y)
        return x, y

dataset = MyDataset()
loader = DataLoader(dataset, batch_size=1, shuffle=True)

for data, target in loader:
    # remove batch dimension added by the DataLoader
    print(data.shape, target.shape)
# loading 15 samples at offset 80
# torch.Size([15, 1]) torch.Size([15])
# loading 10 samples at offset 0
# torch.Size([10, 1]) torch.Size([10])
# loading 50 samples at offset 30
# torch.Size([50, 1]) torch.Size([50])
# loading 20 samples at offset 10
# torch.Size([20, 1]) torch.Size([20])
# loading 5 samples at offset 95
# torch.Size([5, 1]) torch.Size([5])

You could also take a look at the BatchSampler approach if it would fit your use case better as described in this post.

Thank you very much!! I will try it out and hopefully report success. Just looking at it I fail to see how this dataset will work as it does not return a single item in getitem and therefore should fail the same as mine …

You still didn’t explain where exactly the error message is raised and since my code snippet is executable and not failing I also don’t know why you think it should fail.

You are right. This is the “minimal” example I would love to get working and understand why it does:
The current error is a little bit different than the one I posted as I tinkered a little bit.

import torch
import numpy as np
import pandas as pd
import random

data = np.random.rand(100, 10)
df = pd.DataFrame(data, columns=[f'{i}' for i in range(10)])
df["cat_var"] = [random.choice([f"batch_{i+1}" for i in range(5)]) for j in range(100)]
device = "cpu"

class MyDataset:
    def __init__(self, df, features, target, cat_var):
        self.df = df
        self.features = features
        self.target = target
        self.category_var = list(df.cat_var.unique())

    def __len__(self):
        return len(self.category_var)

    def __getitem__(self, idx):
        var = self.category_var[idx]
        X = torch.tensor(self.df.query("cat_var==@var")[self.features].values)
        y = torch.tensor(self.df.query("cat_var==@var")[self.target])
        return X, y

dataset = MyDataset(df, [f'{i}' for i in range(9)], '9', "cat_var")
train_loader = torch.utils.data.DataLoader(dataset)

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.layer = torch.nn.Linear(in_features=9, out_features=1)

    def forward(self, x):
        return self.layer(x)

model = Model().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
loss = torch.nn.MSELoss()
epoch_losses = []
for epoch in range(5):
    epoch_loss = 0
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        pred = model(X)
        l = loss(pred, y)
        epoch_loss += l.item()