Data Loader is not working

Flock1 · June 11, 2021, 12:50pm

I have multiple csv files which contain 1D data and I want to use each row. Each file contains different number of rows. So I have written a dataloader like this:

class data_gen(torch.utils.data.Dataset):
    def __init__(self, files):
        
        self.files = files
        print("FILES: ", type(self.files))
        
    def __getitem__(self, i):

        print("GETite,")
        
        file1 = self.files[i]
        print("FILE1: ", file1)
        my_data = np.genfromtxt('/data/'+file1, delimiter=',')
        
        # file1 = np.reshape(file1,(1,len(file1)))
        # file1 = torch.from_numpy(file1).float()
        
        # return data
        print(len(my_data))
        return my_data

    def __len__(self): 
        
        return len(self.files)

However, when I call it like this:

train_dl_spec = data_gen(train_files[0])

I get the following output:

FILES:  <class 'str'>

It’s not processing __getitem__ for some reason. What could be the reason?

M_T · June 11, 2021, 1:07pm

You’re likely better off concatenating those CSV files prior to initializing the dataset object.

Handle all of that outside of the Dataset.

Flock1 · June 11, 2021, 2:21pm

I tried that. But the file becomes so big that it doesn’t load on RAM then.

ptrblck · June 11, 2021, 6:45pm

I’m not sure I understand the issue correctly.
The mentioned output is created in the __init__, so it seems the Dataset is initialized properly.
What kind of issue are you seeing when calling train_dl_spec[0]?

ejguan · June 12, 2021, 2:08am

You are passing a train_files[0] that is a string to self.files. No sure if I fully understand your question. I guess you want data_gen(train_files)[0] to invoke __getitem__ from data_gen instance.

Flock1 · June 12, 2021, 3:55am

I think I was able to solve it. I first restarted the kernel and then, I edited the class:

class data_gen(torch.utils.data.Dataset):
    def __init__(self, files):
        
        self.files = files
        my_data = np.genfromtxt('/data/'+files, delimiter=',')
        self.dim = my_data.shape[1]
        self.data = []
        
    def __getitem__(self, i):

        file1 = self.files
        my_data = np.genfromtxt('/data/'+file1, delimiter=',')
        self.dim = my_data.shape[1]

        for j in range(my_data.shape[1]):
            tmp = np.reshape(my_data[:,j],(1,my_data.shape[0]))
            tmp = torch.from_numpy(tmp).float()
            self.data.append(tmp)        
        
        return self.data[i]

    def __len__(self): 
        
        return self.dim

And now, it’s working when I call
train_loader = torch.utils.data.DataLoader( train_dl_spec, batch_size=128, shuffle=True, num_workers=8, pin_memory=True)