How do I create custom dataset for KDD99 that can be input for Dataloader()

MikeT · November 18, 2019, 6:59pm

Hi, guys
I am doing a project related to KDD99 dataset. After some data processing, I was able to represent the dataset as a matrix with (125973,123) dimension with ‘label’ of 5 classifications(range 0-4). After normalization, I get a matrix shown below:(each row is treated as one training example, each column is treated as one feature(excluding ‘label’ column))
Then I found some existing code from other project that I want to use, but the project is dealing with FashionMNIST dataset which is a built-in dataset that can be directly accessed using interface functions.

So I am kinda stuck and don’t know how to code to load my dataset to Dataloader() in same fashion so that I can use existing code.
I’d really appreciate if someone can point me the right direction.

beaupreda · November 18, 2019, 7:46pm

Hello, what you need to do is implement your own dataset using the torch.utils.data.Dataset like this small example:

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels

    def __getitem__(self, index):
        return self.inputs[index, :], self.labels[index]

    def __len__(self):
        return len(self.input)

inputs = torch.rand(125973, 122, dtype=torch.float32)
labels = torch.rand(125973, dtype=torch.float32)
dataset = CustomDataset(inputs, labels)
train_loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
for binputs, blabels in train_loader:
    print(binputs)
    print(blabels)

Basically, you inherit from the PyTorch Dataset class and you implement the __getitem__ and __len__ methods. __getitem__ is called to extract batch tensors and __len__ determines the number of elements in the dataset. After that, use the PyTorch DataLoader like in your example with your own dataset and everything should work fine!

The following site also shows a good example of how to use the Dataset class of PyTorch.
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel

Hope this helps!

(Sorry for the edit, I accidently replied half an answer)