Speed up Pytorch: seperate data augmentation and network training


(Ywu36) #1

Hi everyone,
I hope to do data-augmentation ‘on-the-fly’. According to this link: Fast data loader for Imagenet, data-augmentation can significantly slow down the training process.

I am curious is there a way to use one process to augment data and save augmented ‘dataLoader’ in separate files, use another process to load the saved ‘dataloaders’ and train the network ? The two processes can run simultaneously so that the overload of data augmentation can be alleviated.

Another question, is there a way to save ‘dataLoader’ as a file (e.g., LMDB) to the hard-disk? I need to process a large database which cannot fit into the RAM.

Thanks,
Yuhang


(Francisco Massa) #2

You don’t need to perform the data augmentation on the fly.
You could store to disk all the augmented data, and load them as if it was the full dataset.
The process that you mentioned might be a bit risky, as you might have the situation where you try to load a file that has already been deleted by the OS, but it could be done.
The way we currently do is to use several threads to perform data loading and augmentation, and that significantly speeds-up the process.

It is totally possible to use LMDBs or HDF5 files to store your data, all you need to do is to write a Dataset that loads from those files.
You can find an example of using LMDB here https://github.com/pytorch/vision/blob/master/torchvision/datasets/lsun.py


(Ywu36) #3

Thanks for reply. In my situation, the database is very large and I could not save all the augmented data on the disk. I came up with a new idea yesterday that is using multi-process:

    loaddataParallel(dataTransformed, imgPaths, ldmkPaths)   # First do a data augmentation based on image and landmark paths, save the transformed data inside 'dataTransformed'
    training_data_loader = DataLoader(dataset=dataTransformed, num_workers=opt.threads, batch_size=opt.batchSize,shuffle=True)   # Put the transformed (augmented) data into dataloader
    for dataBlockIdx in range(1,30):  # Read data from different folders
        processes = []
        imgPaths = glob.glob(Folders[dataBlockIdx] + '/img/*.*g')
        ldmkPaths = glob.glob(Folder[dataBlockIdx] + '/img/*.mat')
        dataAugmentation = mp.Process(target=loaddataParallel, args=(dataTransformed, imgPaths, ldmkPaths))  # Create a process called data augmentation
        dataAugmentation.start() # Start the process
        processes.append(dataAugmentation)
        trainNetParallel(epochIdx, unet, training_data_loader)  # At the same time, train the network
        for p in processes:
            p.join() # Wait until the 'data augmentation' process end
        training_data_loader = DataLoader(dataset=dataTransformed, num_workers=opt.threads, batch_size=opt.batchSize,shuffle=True) # Put the transformed (augmented) data into the dataLoader (new)

The data augmentation and network training can be done simultaneously now. I did not put the ‘trainNetParallel’ into the parallel pool, it may trigger a ‘CUDA re-initialization’ error in Python 2.7. Without putting it into the pool, the program can still running.

Indeed, use several threads to perform data loading and augmentation can also help to speeds-up the process, it can be done inside ‘loaddataParallel’.

Here comes another question, in the default implementation of Pytorch, is ‘getitem’ in ‘torch.utils.data.Dataset’ runs in parallel ? All the data transformation are accomplished here.


#4

yes, __getitem__ and the augmentations are run in parallel.