Does DataLoader iterate through indexes to generate a batch?


(Srinivas Sekar) #1

Hi,

I have a Dataset class to which I pass in a Pandas df. My __getitem__ method looks like below.

>   def __getitem__(self, index):
>         x = self.df.iloc[index]['column_1']
>         a, b = self.some_function(x)    
>         label = self.df.iloc[index]['label']
> return a, b, label

When I pass the Dataset object to a DataLoader and generate a batch, with batchsize 5 for example, does the DataLoader generate a batch by looping through a list of 5 indices and get one data point at a time from getitem? Ideally, since I’m passing a dataframe into my Dataset class, it would be quicker if index was a list like [0,1,2,3,4] instead of passing it as individual indices.

I ask this because right now I’m bottlenecked at the CPU with the DataLoader. Any suggestions on how I could modify the code to subset my df into batches without looping over indices would be greatly welcome!

Thank you.


#2

Have a look at this code to see how to provide a list of indices to your Dataset.


(Srinivas Sekar) #3

Thank you very much!


(Vitaliy Bondarenko) #4

cool, thank you! i was looking for that as well