PyTorch data loader bottleneck

weiqqi1028 · April 2, 2021, 6:38am

My model training is bottlenecked by IO, and I stream data from S3 using AWS wrangler. I only use 1 GPU for my model training. My machine has 8 GPUs, and I found when I run multiple training jobs, e.g. 4 jobs on my machine, the total IO of the machine increased 4 times. This demonstrated that the S3 throughput and network do not bottleneck my IO.

It seems that a single data loader has a bottleneck, and I am curious about where the bottleneck comes from. Is there a way to run 4 data loaders to load different part of the dataset simultaneously in one training job and interleave them together, to increase the IO performance of the data loading?

omarfoq · April 2, 2021, 4:21pm

Hello

Normally DataLoader has an attribute num_workers, you can modify in order to speed-up data reading (by default it is 0)

weiqqi1028 · April 2, 2021, 5:23pm

I did tune num_workers, and tune the prefetch factor.

omarfoq · April 2, 2021, 5:34pm

Hello,

In that case the only solution for you is to optimize the part of the code responsible for reading data (Dataset.__getitem__). If it’s possible load all the data into RAM at once, instead of reading from drive at each batch. Also note that when number of workers is different from 1, you have multiple processes loading data at once, so this corresponds to your initial question I guess

weiqqi1028 · April 2, 2021, 9:31pm

Thanks for reply. My question is why running multiple data loaders instances can increase the IO throughput, when I have already used multiple workers in data loader?

omarfoq · April 2, 2021, 9:40pm

Hello,

Because that’s related to how money processes are working on loading data. If you have more processes loading data, they will be fast. But now, I am not sure if I understand your question

weiqqi1028 · April 2, 2021, 9:47pm

This is my data loader setting:

train_data_loader = DataLoader(
        train_dataset,
        batch_size=batch_size
        num_workers=os.cpu_count()
        collate_fn=model.get_feature_to_input_fn(),
        prefetch_factor=args.prefetch_factor,
    )

Inside the train_dataset iter function, I also used multiple threads to stream data from s3. The observation is, with this setup my IO is k/min, however if I create 4 data loaders with this setup, my IO is 4k/min. No matter what parameters I tune for a single data loader, the single data loader performance can not reach 4k/min.