I have a problem loading my data.
I have my data saved into a CSV where each row represents a sample. I have a classification problem so the data is separated into different files, each file contains samples for one class. Also, at the end of each line (row) there is the label for the class
Until now I was able to load the whole data into memory (with pandas), separate the labels from the actual data, shuffle the data and finally pass it as numpy array to pytorch dataset and load it with dataloader.
But now I have a much bigger data which I cant load all at once.
So I concatenated the whole data into one big csv file and shuffled it, and now I just need to get “batch_num” lines for each training iteration (preferably with shuffling and with handling of not complete last batch).
I tried building a dataset class with pandas.load_csv and marking the “iterator = True” option, and implement the get_item function with the “get chunk” function from pandas (basically gets specified number of rows at a time). Unfortunately, I was bombarded with errors both from pandas and pytorch while trying to write that.
Is there a simple way of loading samples from a large CSV file?
Do you know of any implementation of a similar dataset/dataloader that I can use?
I can manipulate the data in many ways: separate the samples into files by class, separate the samples into files so each file will have one sample (though it might cause space issues), load the whole data into the memory (on another machine) and save it as numpy array file or some other format…
Maybe I should just change my data and use existing dataset/dataloader that was build for that case?
Hope someone who had experience with that can help me out.
EDIT: Here is a simple code you can run and test it
import numpy as np import pandas as pd from torch.utils.data import DataLoader, Dataset np.random.randint(10, size=(100,6)) np.savetxt("test.csv", np.random.randint(10, size=(100,6)), fmt='%1i', delimiter=",") csv_reader = pd.read_csv('test.csv', sep=',',header = None, iterator=True, chunksize=1) ## printing one row from the file print(csv_reader.get_chunk(1)) ## trying to build the dataset class csvDataset(Dataset): def __init__(self, csv_file): self.scv_file_name = csv_file self.csvFileHandle = pd.read_csv(csv_file, sep=',',header = None, iterator=True, chunksize=1) def __len__(self): return sum(1 for line in open(self.scv_file_name)) def __getitem__(self,idx): ## getting one sample(row) at a time and returning it sample = self.csvFileHandle.get_chunk(1) return sample ## testing the dataset CSV_dataset = csvDataset(csv_file='test.csv') for i in range(len((CSV_dataset))): sample = CSV_dataset ## printing very ugly rows one by one print(sample) if i == 7: break dataloader = DataLoader(CSV_dataset, batch_size=1, shuffle=False, num_workers=1) for i_batch, sample_batched in enumerate(dataloader): print(i_batch, sample_batched)
At the end I get errors when trying to use the dataloader.
EDIT2: Now that I think about it, I probably could get away by using only the dataset - giving the dataset a batch size, and getting the samples with “get_chunk(batch)” and manually handling the last batch, also I could shuffle the samples on each batch but the in every epoch each batch would still contain the same samples so that is not optimal. It’s also doesn’t do multythreading.
So I’m still waiting for better suggestions if there are any.