But this is working terribly slow. I was wondering if I could store all of that data in one file but I don’t have enough RAM. So is there a way around it?
I am confused about a few things here. For example, why is all of the file loaded when only a single (column?) is returned at the end? Additionally, why is data repeatedly appended to without checking if a given index has already been retrieved? If you say that all of your data cannot fit in memory, it looks like this solution will keep increasing the amount of data stored in memory without ever deleting anything. Finally, are you using this dataset directly without a DataLoader that provides more parallelism?
For the first issue if it turns out that you don’t need to read the entire file, you might want to see if you can use skip_header in the numpy params: numpy.genfromtxt — NumPy v1.20 Manual. In case you want to skip columns instead you could consider “transposing” your csv files offline so that you can skip rows at data loading time.
This is just an example. I am actually using a for loop that iterates over all the csv files. Each column in the file is a datapoint and in the end (self.data[i]), I am returning the column because that’s what I want batches of. This is how I’m implementing the TrainLoader:
train_loader = torch.utils.data.DataLoader(
train_dl_spec, batch_size=128, shuffle=True, num_workers=8, pin_memory=True)
for data in train_loader:
This is how I am implementing it in the training process:
for x_train in tqdm(train_files):
train_dl_spec = data_gen(x_train)
train_loader = torch.utils.data.DataLoader(
train_dl_spec, batch_size=128, shuffle=True, num_workers=8, pin_memory=True)
for data in train_loader:
If columns are datapoints you might consider preprocessing all of the csv files to move columns to rows offline so that you can seek through files without loading basically the entire file to load a single data point.