Training from multiple csv files

Bixqu · June 26, 2018, 12:14pm

I want to train a model from a folder of (hundreds of) .csv files. How can I load this data and feed it to my model without loading all of it into memory at once?

Jindong · June 26, 2018, 1:33pm

You can define a class, and for every step, you just read the data you need with the dataloader. For more information, you can read the tutorial below:

https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

Best wishes.

Bixqu · June 26, 2018, 1:57pm

Thanks Jindong, I was reading through that tutorial, however is it possible to do something like:


def get_data(path):
    df = pd.read_csv(path)
    return df.as_matrix 

data_sets = datasets.DatasetFolder(path_to_datasets, 
                                   loader=get_data, extensions=['.csv'])
train_loader = torch.utils.data.DataLoader(data_sets,
                                          batch_size=32,
                                          shuffle=False,
                                          num_workers=4)

to read from all .csv files in the folder and train the model? (this gives an error, telling me data_sets is a method, but is anything along these line possible?).

Jindong · June 28, 2018, 11:46am

Hi,

That is because you forgot the () after the as_matrix.

Try

df.as_matrix()

Best wishes.