I have a problem understanding how I can define a Dataset/DataLoader combination that processes batches of a custom size. I have a tabular dataset with a categorical variable defining the batch. I define the dataset like
class MyDataset:
def __init__(self, df, features, target):
self.df = df
self.features = features
self.target = target
self.category_var = list(self.df.category_var.unique())
def __len__(self):
return len(self.category_var)
def __getitem__(self, idx):
var = self.category_var[idx]
X = self.df.query('category_var==@var')[self.features]
y = self.df.query('category_var==@var')[self.target]
return X,y
So each item in my dataset is a custom sized batch of samples I want to process.
I guess I am understanding the concept of custom dataset and dataloaders, maybe even the definition of a batch wrong.
Is my dataset meant to return a batch or a sample?
If sample: How can my dataset define a custom batch size if it only returns one sample? Is that not the job of the dataset? if not, whose is it? the data loaders? the collate_fn?
In the standard use case your __getitem__ method will load, process, and return a single sample.
I’m unsure what “custom” batch size means. If you want to set the batch size to a specific value you could do so in the DataLoader by setting the batch_size argument.
Thank you very much for your reply.
I do not mean setting a custom batch_size argument, because that would be fixed for all batches. I want every one of my batches to have a different size.
In my example I have a variable (called category_var) which groups the data into batches and my __getitem__ wrongly returns one of these batches.
As Dataset.__getitem__ is not the place to do this. Do you have a hint on how to achieve this?
I think Dataset.__getitem__ could still be used to load multiple samples and you could then skip the batching from the DataLoader by setting batch_size=1 during its initialization.
I’m currently unsure what exactly is failing as your initial post doesn’t mention the line of code using your approach.
thx again. Do you know of a minimal working example (kind of the opposite of what you ask from me…) I could start off of? I really struggle to wrap my head around the interaction of Dataset, Dataloader, Sampler and Batches.
Thank you very much!! I will try it out and hopefully report success. Just looking at it I fail to see how this dataset will work as it does not return a single item in getitem and therefore should fail the same as mine …
You still didn’t explain where exactly the error message is raised and since my code snippet is executable and not failing I also don’t know why you think it should fail.
You are right. This is the “minimal” example I would love to get working and understand why it does:
The current error is a little bit different than the one I posted as I tinkered a little bit.
import torch
import numpy as np
import pandas as pd
import random
data = np.random.rand(100, 10)
df = pd.DataFrame(data, columns=[f'{i}' for i in range(10)])
df["cat_var"] = [random.choice([f"batch_{i+1}" for i in range(5)]) for j in range(100)]
device = "cpu"
class MyDataset:
def __init__(self, df, features, target, cat_var):
self.df = df
self.features = features
self.target = target
self.category_var = list(df.cat_var.unique())
def __len__(self):
return len(self.category_var)
def __getitem__(self, idx):
print(idx)
var = self.category_var[idx]
X = torch.tensor(self.df.query("cat_var==@var")[self.features].values)
y = torch.tensor(self.df.query("cat_var==@var")[self.target])
return X, y
dataset = MyDataset(df, [f'{i}' for i in range(9)], '9', "cat_var")
train_loader = torch.utils.data.DataLoader(dataset)
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.layer = torch.nn.Linear(in_features=9, out_features=1)
def forward(self, x):
return self.layer(x)
model = Model().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
loss = torch.nn.MSELoss()
epoch_losses = []
for epoch in range(5):
epoch_loss = 0
for X, y in train_loader:
X, y = X.to(device), y.to(device)
pred = model(X)
l = loss(pred, y)
epoch_loss += l.item()
optimizer.zero_grad()
l.backward()
optimizer.step()