Multiprocessing to speed up data pre-processing during training?

T.Liu · March 17, 2020, 10:51pm

Hi everyone,

We have some data pre-processing using PyTorch & Numpy before feeding into training of the network. And these are now done serially, which slows down our training process. Is there any approach in PyTorch to speed it up such as using multiprocessing? Any kinds of suggestsions are also welcome. Thanks.

ayalaa2 · March 18, 2020, 1:12am

PyTorch supports multi-process data loading.

The general idea is to create a dataset class and implement the pre-processing within the get functionality. If you create a data loader instance with that dataset and set num_workers, worker processes will handle fetching and preprocessing data.