How to create and handle the dataloaders for custom dataset (csv files)


I acquired a dataset with tweets where i did some preprocessing on it and now is the moment to load it in Pytorch in order to create and test some models. It’s the first time that I will use a custom dataset and thus it’s the first time for me to manually handle the dataloaders and the Dataset class.

My questions are these:

First of all, what is the appropriate way to organise the csv files before feeding them to the dataloader? 1 file with all the tweets and their labels? 2 csv files with the tweets and their labels for train and test (train_data, test_data)? or 4 files (train_x, train_y, test_x, test_y)?

Secondly, I am extremely confused on how I should create the dataloaders, how to implement the Dataset class and how the text should be represented (in which form) etc.

Can anyone guide me or give me some good resources regarding that specific procedure?

Thanks in advance :slight_smile:

Regarding csv files, I use pandas to read it. In your case, you should convert each tweet to a vector before loading it to dataloader.
This tutorial explains how to make a dataloader from csv file.
So basically, you have to define init_, len, and getitem functions for a data loader. In addition, your init should have the part of converting tweet to a vector.

1 Like

Thank you for your answer!

Could you elaborate a little with me and tell me If I perceived anything not correctly?

I have a dataframe with 2 columns (sentiment and text) where the text is already preprocessed.

In the init() method of the Dataset class, the dataframe is given as an argument and then I should proceed with the vectorization process. In the vectorization process, I need to create a vocabulary of each word contained in the dataset and map it to an integer, right? Then I will create a third column called text_vectorized, where each sentece will be a list containing integers that correspond to the words of the sentence.

In the len() method i should return the size of the whole dataset, right?

Last but not least, In the getitem() method, an index must be given as an argument in order to retrieve the corresponding vectorized text along with its label.

Have I missed something, or am I okay to go? :slight_smile:

In the init() method, you should make your data ready. I usually make it becomes self.x, and self.t for input vectors and target labels respectively. Yes, you have to create a vocabulary in the process of converting each text to vector. You can define a separate function for the converting to reduce the complexity of init() method.

len() and getitem() methods tell the dataloader how you want it to return data in each minibatch. So it will return a bunch of samples, each with separate indexes idx. It will automatically allocate the idx between 0 and the len() of your dataset.

1 Like

So the len() method should return the dataset’s length, right?

You kinda confused me with the getitem() method :stuck_out_tongue: In the tutorial from the link you have provided me in your first reply, the getitem() returns a single “object” with its label. Now you told me that it needs to return a bunch of samples. Could you make it more clear please? Cheers

Yes, len() method should return the dataset’s length.
The getitem() return a single datapoint at index idx everytime it’s called. The dataloader will call it multiple times, depending on the provided batch_size.
In the tutorial, they use batch_size = 4. shuffle = True tells it to randomly select the index between 0 and the len()

dataloader = DataLoader(transformed_dataset, batch_size=4,
shuffle=True, num_workers=4)

1 Like

In short, to create a dataloader, you need to define your data in init(), how you want it to return a single datapoint given an index idx in getitem(), and the range the idx should go ( between 0 and the len()). The dataloader will handle the rest on its own.

1 Like

Thank you very much, it’s all clear now!

One last clarification! At this moment, what we discussed is what i have to implement for now.

Word embeddings, transformation to tensors, etc are later concerns, right?

I suggest you make your data ready before put the data in dataloader. The “ready data” means they are all vectors number with corresponding labels. So yes, word embeddings, transformations …
The transformation to tensors can be handled later in training. The idea is that if you are going to train it on GPU, you should put only the current mini-batch on GPU to save its memory. But if your GPU can handle all of your data at once then it’s fine to do the transformation right in the init() method.