Dataloader does not return batch-size cut iterator. each data in iterator has full size as total data

I want features from every iteration have the size of (32, 390).
but my custom dataloader gave me full size of data

Hello,

The problem that you have is a MemoryError since you continue to use your whole dataset in your __getitem__ method. In __getitem__, what you need to do is return 1 example from your dataset (so 1 feature tensor and 1 output tensor in your case). The PyTorch DataLoader will take care of giving you a batch tensor (it will concatenate batch_size tensors from your __getitem__ function). With that in mind, here is a small example of how to adapt your code so you don’t get the error:

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, idx):
        # do whatever you do with your columns.
        output_col = 4
        # only take the index of the data that you are interested in.
        output = self.data[idx, output_col]

        # arbitrary number on my part, use your columns!
        normalized_data = self.data[idx, 10:12]
        one_hot_encoded_data = self.data[idx, 90:100]

        features = np.concatenate([normalized_data, one_hot_encoded_data])
        features = torch.from_numpy(features)

        return features, output

    def __len__(self):
        return len(self.data)

data = np.random.rand(838682, 390)
dataset = CustomDataset(data)
loader = torch.utils.data.DataLoader(dataset, batch_size=32)

for feature, label in loader:
    print(feature.shape)
    # torch.Size([32, 12])

Hope it helps!

1 Like

Thank you! It solves my problem:)