Loading continuously data(samples) from a large CSV file?

Art · December 19, 2017, 8:48am

Hello,
I have a problem loading my data.
I have my data saved into a CSV where each row represents a sample. I have a classification problem so the data is separated into different files, each file contains samples for one class. Also, at the end of each line (row) there is the label for the class
Until now I was able to load the whole data into memory (with pandas), separate the labels from the actual data, shuffle the data and finally pass it as numpy array to pytorch dataset and load it with dataloader.

But now I have a much bigger data which I cant load all at once.
So I concatenated the whole data into one big csv file and shuffled it, and now I just need to get “batch_num” lines for each training iteration (preferably with shuffling and with handling of not complete last batch).
I tried building a dataset class with pandas.load_csv and marking the “iterator = True” option, and implement the get_item function with the “get chunk” function from pandas (basically gets specified number of rows at a time). Unfortunately, I was bombarded with errors both from pandas and pytorch while trying to write that.

Is there a simple way of loading samples from a large CSV file?
Do you know of any implementation of a similar dataset/dataloader that I can use?

I can manipulate the data in many ways: separate the samples into files by class, separate the samples into files so each file will have one sample (though it might cause space issues), load the whole data into the memory (on another machine) and save it as numpy array file or some other format…
Maybe I should just change my data and use existing dataset/dataloader that was build for that case?

Hope someone who had experience with that can help me out.
Thanks.

EDIT: Here is a simple code you can run and test it

import numpy as np

import pandas as pd

from torch.utils.data import  DataLoader, Dataset


np.random.randint(10, size=(100,6))
np.savetxt("test.csv", np.random.randint(10, size=(100,6)), fmt='%1i', delimiter=",")

csv_reader = pd.read_csv('test.csv', sep=',',header = None, iterator=True, chunksize=1)
## printing one row from the file
print(csv_reader.get_chunk(1))


## trying to build the dataset

class csvDataset(Dataset):

    def __init__(self, csv_file):
        self.scv_file_name = csv_file
        self.csvFileHandle = pd.read_csv(csv_file, sep=',',header = None, iterator=True, chunksize=1)
        

    def __len__(self):
        return sum(1 for line in open(self.scv_file_name))
    
    def __getitem__(self,idx):
        ## getting one sample(row) at a time and returning it
        sample = self.csvFileHandle.get_chunk(1)

        return sample
    
## testing the dataset 
CSV_dataset = csvDataset(csv_file='test.csv')
for i in range(len((CSV_dataset))):
    sample = CSV_dataset[1]
    ## printing very ugly rows one by one
    print(sample)
    if i == 7:
        break
    
dataloader = DataLoader(CSV_dataset, batch_size=1,
                        shuffle=False, num_workers=1)

for i_batch, sample_batched in enumerate(dataloader):
    print(i_batch, sample_batched)

At the end I get errors when trying to use the dataloader.

EDIT2: Now that I think about it, I probably could get away by using only the dataset - giving the dataset a batch size, and getting the samples with “get_chunk(batch)” and manually handling the last batch, also I could shuffle the samples on each batch but the in every epoch each batch would still contain the same samples so that is not optimal. It’s also doesn’t do multythreading.
So I’m still waiting for better suggestions if there are any.

taiky · February 28, 2018, 9:51pm

@Art
Same problem with you, have you fixed that?