Discussion about datasets and dataloaders


(Clément Pinard) #1

Hello there,
I have been working on a pytorch implementation of FlowNet, as it will be useful for me and makes me train to use it. (convergence is still WIP)

However, there has been some issues that I had to solve in order to match my workflow. So I created this topic to either discuss about possible ameliorations in the dataset interface or ameliorations in my own workflow, which i like but may be far from perfect.

transform functions
As dicussed here , currently, transform functions are not supporting coherent transformations between input and target, because the random parameters ( such as flip or not, or crop coordinates) are created within the function call. It won’t be a problem for classification, but will be for all kinds of geometric problems, from Bounding boxes localisation to flow/depth estimation.
To address this issue, i created co_transforms that take both input and target for arguments. that way you can create your own transformation that will keep your target coherent to you augment input.
See it here
The big problem to my mind is the fact that co_transforms and transforms / target_transforms are ambiguous regarding the order we call them. Do we call the co_transforms first or last ? If called last, all co_transforms must deal with pytorch tensors, while when called first, they will deal different kinds of data structures.
In my code, you will see here that I chose to do it first, but before that I hard-coded numpy conversion for images (calling imread from scipy instead of PIL load) to always deal with numpy arrays. This was to me the only way, because transforms and target_transforms have the ToTensor() functions (which i slightly modified in ArrayToTensor() to have the correct HxWxC to CxHxW conversion. What’s more, dealing with PIL operations and numpy array at the same time is very risky as h and w are in reverse order when calling PIL or array functions.

Splitted dataset
I also decided to try a splitted dataset. As suggested by some, the simpliest way is to manually glob files and make two different datasets from image paths lists. But i think a splitted dataset is a common need when you are developping your own dataset, so I tried to get it all in a unified class with a split parameter, which applies here to the flying chairs dataset which is just a folder with img pairs and associated flo files. (dataset code )
To move from train dataset to test dataset, you just have to call dataset.eval() or dataset.train() before calling the data loader the same way you do it when putting the network in train or test mode.

Dataloader and splitted dataset
The main problem with this dataset is the fact that DataLoader are not particularly suited for dynamically changing datasets. Especially with samplers. There is currently two different samplers, sequential and random. And both get their dataset length at creation. Is it really saving cpu load to not get it from dataset’s __len__ function each time ? I thus created my own dataset sampler that does not assume dataset length, which permits it to change, here. I also wanted to control my epoch size, in order to run inference test on validation more often, which is useufull for rapidly changing networks. In this code you will see that samples selection is random but without replacement, to make sure we go through the whole dataset before having a chance to select the same sample twice, because when restarting an epoch, official random sampler is reinitialized.

Difference between dataset and Data Loader
The main issue I had to face and for which i could not find a good workaround is the data augmentation parameter. Basically, data augmentation transformations are dealt with during dataset’s __get__ function so when I call dataset.eval() I shut down the co_transforms which is not necessarily what I want. So you should have either 2x3 set of transformations, 1 set of 3 (transform,target_transformandco_transform` ) for test and for train, or deal with it somewhere else.
And i think it makes more sense to deal with it in Data Loaders, that way you have transformations that are independant from dataset sampling. What I usually did on torch7 was to test every now and then inference on train set, without data augmentation and network set to eval, to see how much we overfit. this makes sense when data augmentation can be e.g. adding noise or blur to input, you should be able to get easily train samples without it.

So to my mind, dataset would be the class where you decide where to take samples, and data loader is the one deciding if we apply specific data augmentation routines or not.

dataset = datasets.foo(data, split)
train_loader = torch.utils.data.DataLoader(
    dataset,batch_size,workers,
    transforms=[input_transform,target_transform,co_transform]
    )
test_loader = torch.utils.data.DataLoader(
    dataset,batch_size,workers,
    transforms=[input_transform_test,target_transform_test,co_transform_test]
    )
dataset.train()
enumerate(train_loader) #*batches from train set with data augmentation
dataset.eval()
enumerate(test_loader) #batches from test set without data augmentation
dataset.train()
enumerate(test_loader) #batches from train set without data augmentation

The result will be that all graphical operations should be done directly with Tensors (as current samplers expect lists of tensors), but that was already done with torch image toolbox. What’s more, CUDA solutions as dicussed here and maybe can be done with CudaTensor ?

I also could find problems regarding numpy conversion from HxWxC to CxHxW commented in the source. Dealing with tensors from beginning to the end could lead to time improvements regarding data loading.

Last problem will be that graphic functions will be involved in a module that is not from vision (because data loaders are from pytorch/utils) but i think vision was separated from the rest because it involved PIL operations which was not necessary for some other problems such as text embedding. But I think if we work with tensors, these transform don’t have to be graphics and could be anything as long as tensors are given in output.

I hope my suggestions regarding dataset handling were not too naive, and feel free to suggest better ways of doing it (which can be the already existing one!).
If some of my ideas seem good to you, I’d be happy to contribute to it.

sum up of ideas

  • co_transforms
  • splitted dataset
  • dynamic samplers
  • random sampling without replacement for epoch size < dataset size
  • attach transformations to data loaders
  • add tensor related image loading and transformations to avoid numpy HxWxC to CxHxW conversion

Feedback on PyTorch for Kaggle competitions
(Shaun) #2

Do you have an update on this especially the data split with dataloader for train, validation, and test?


#3

same question, Do you have an update on this especially the data split with dataloader for train, validation, and test? Im about to do the same thing and dont want to write a new piece of code for this.


#4

@deepcode no one’s worked on splitting datasets yet.


(Sasank Chilamkurthy) #5

tnt has SplitDataset.
It has method to select the partition and therefore, changes the data from __getitem__.

I’m not too sure how it interacts with DataLoader though. DataLoader seems to make an index or something before returning the iterator. So modifying the dataset by selecting partition might break DataLoader.


(Kevin) #6

If anyone’s stuck on creating train, validation and test splits for datasets in torchvision.datasets, I’ve created a small gist that supports transformations, shuffling, seeding, and optional plotting.

The main logic of the code is as follows:

  • Generate a train dataset and test dataset using the argument train in torchvision.datasets.XYZ where XYZ is your desired data (i.e. CIFAR10 or ImageNet).
  • Figure out the length of your validation set num_valid. If say you want 10% of your training data to be used for validation, then you would multiply the total length of the training set by 0.1.
  • Create a list of indices of size num_train, shuffle it, and then slice num_valid indices from it and call it valid_idx and store the rest in train_idx.
  • Feed these indices to separate instances of SubsetRandomSampler.
  • Finally feed these 2 samplers to torch.utils.data.DataLoader using the sampler argument.

And voila!

Note that the code is inspired by Mamy Ratsimbazafy’s code which you can view here.