What is the fastest way to load data from multiple csv files

Hi,

I am working with multiple csv files, each containing multiple 1D data. I have about 9000 such files and total combined data is about 40 GB.

I have written a dataloader like this:

class data_gen(torch.utils.data.Dataset):
    def __init__(self, files):
        
        self.files = files
        my_data = np.genfromtxt('/data/'+files, delimiter=',')
        self.dim = my_data.shape[1]
        self.data = []
        
    def __getitem__(self, i):

        file1 = self.files
        my_data = np.genfromtxt('/data/'+file1, delimiter=',')
        self.dim = my_data.shape[1]

        for j in range(my_data.shape[1]):
            tmp = np.reshape(my_data[:,j],(1,my_data.shape[0]))
            tmp = torch.from_numpy(tmp).float()
            self.data.append(tmp)        
        
        return self.data[i]

    def __len__(self): 
        
        return self.dim

But this is working terribly slow. I was wondering if I could store all of that data in one file but I don’t have enough RAM. So is there a way around it?

Let me know if there’s a way.

I am confused about a few things here. For example, why is all of the file loaded when only a single (column?) is returned at the end? Additionally, why is data repeatedly appended to without checking if a given index has already been retrieved? If you say that all of your data cannot fit in memory, it looks like this solution will keep increasing the amount of data stored in memory without ever deleting anything. Finally, are you using this dataset directly without a DataLoader that provides more parallelism?

For the first issue if it turns out that you don’t need to read the entire file, you might want to see if you can use skip_header in the numpy params: numpy.genfromtxt — NumPy v1.20 Manual. In case you want to skip columns instead you could consider “transposing” your csv files offline so that you can skip rows at data loading time.

This is just an example. I am actually using a for loop that iterates over all the csv files. Each column in the file is a datapoint and in the end (self.data[i]), I am returning the column because that’s what I want batches of. This is how I’m implementing the TrainLoader:

train_loader = torch.utils.data.DataLoader(
         train_dl_spec, batch_size=128, shuffle=True, num_workers=8, pin_memory=True)
for data in train_loader:

This is how I am implementing it in the training process:

for x_train in tqdm(train_files):
        train_dl_spec = data_gen(x_train)
        train_loader = torch.utils.data.DataLoader(
        train_dl_spec, batch_size=128, shuffle=True, num_workers=8, pin_memory=True)
        for data in train_loader:

If columns are datapoints you might consider preprocessing all of the csv files to move columns to rows offline so that you can seek through files without loading basically the entire file to load a single data point.

Yeah, will try that. Let’s see how it works.