Split data for train, test, validation in dataloader

Geoffrey_Payne · January 21, 2020, 11:08am

I take a dataset and split it into 3 and then configure a dataloader to access each one, as follows;

full_data_args={‘data_dir’:‘penguin_data/data’, ‘data_file’:‘penguin_csv.csv’,‘stage’:‘full’}
data_batch = dataset.PenguinData(**full_data_args)
train_data_params = {‘batch_size’:512, ‘shuffle’:True, ‘num_workers’:0, ‘pin_memory’: True}
train_dataset = data.DataLoader(data_batch.train_dataset, **train_data_params)
valid_data_params = {‘batch_size’:16, ‘shuffle’:True, ‘num_workers’:0, ‘pin_memory’: True}
test_dataset = data.DataLoader(data_batch.train_dataset, **valid_data_params)
valid_dataset = data.DataLoader(data_batch.val_dataset, **valid_data_params)

However I understand that a better approach is to attach a dataloader to the whole dataset and use that to access the data for training, testing and validation.
I can’t find an example of how to do this. Can you show me how this can be done? It may be that the batch size is the same.

ptrblck · January 22, 2020, 5:25am

I don’t think this is a better approach and I would stick to separate DataLoaders for each split.
Why do you think it might be a better approach?

Geoffrey_Payne · January 22, 2020, 5:41am

It was what a colleague told me but given your response he must be wrong! Thank you for that.

ptrblck · January 22, 2020, 5:44am

Haha, no he might be right!
It would be interesting to hear his opinion and I might be completely wrong.

In this use case, I think splitting the Datasets before wrapping them in DataLoaders would be clearer than using a custom sampler, collate_fn or other methods to get the corresponding samples from a single DataLoader.