Strange behavior in data loader with workers

Hi,

I am testing Microsoft’s pipedream approach for distributed learning (https://github.com/msr-fiddle/pipedream)
I am using Alexnet with a subset of ImageNet data and testing the approach using 4 K80 GPUs.

I am observing a strange behavior in the data loader timing in the first GPU in the pipeline. Pipedream code uses a dataloader as follows.

train_dataset = datasets.ImageFolder(
            traindir,
            transforms.Compose([
                transforms.RandomResizedCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                normalize,
            ]))
...
train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
        num_workers=args.workers, pin_memory=True, sampler=train_sampler, drop_last=True)

Issue is, I am passing args.workers=10, and every multiple of args.workers batch (ex: 10, 20, …) takes significantly higher amount of time in the next(iter(train_loader)) call.

batch time on next(…)
0 6.153
1 0.001
10 5.408
11 0.001
20 4.916
21 0.001

I have checked for multiple batch sizes (64, …, 256, 512) and by changing workers 4 to 10.

Can someone shed some light on this?
Best

1 Like

Each worker will load a batch in the background. Once its batch is returned, the worker will start loading the next batch.
Since you are seeing a slowdown after num_workers batches, this points towards a data loading bottleneck (or a tiny model).

I.e. the first iteration (batch 0) will take some time to load the first complete batch. Once it’s loaded, it’ll be returned and after the forward/backward pass is done, the next iteration will start.
The next worker(s) seem to have already loaded the data and will return it immediately (batch 1 to 9).
In batch 10, the first worker is supposed to yield the next batch, but since the previous 9 iterations were really fast, you will see the slowdown again until all samples for the batch are loaded again.

Hi @ptrblck. Thanks for you feedback.
I was expecting the same, and to test that, I ran the sequential model with 10 data loading workers. It uses the same dataloader in the same way. But there, I did not see a similar behavior.

batch time on next(…)
0 5.76223
1 0.71920
2 0.00025
3 0.00025
10 0.00083
11 0.00077
12 0.00030
13 0.00027
20 0.00056
21 0.00029
22 0.00032
23 0.00031
30 0.00035
31 0.00030

Coming back to your point, shouldn’t the data loading threads load batches ahead of time? Is this an expected behavior for the multi-process dataloader (slow-down in every num_workers batch)?

PS: Clarification on the usage here. In Pipedream, rank 0 worker would be the only one loading data from the dataset, and theoretically, when it comes to data loading, there’s no difference between rank0 worker of pipedream and sequential execution. Main difference would be pipedream rank0 worker would be doing 1/world_size amount work.

Each worker is starting to load the next batch once the current one was returned.
However, if the actual workload is small, you would basically just use the data loading loop as:

for data in loader:
    pass

This dummy loop would have a startup time, yield the next batches pretty fast and slow down again, as each worker has to load the next batch.

Your current output looks interesting, as I thought you have already used 10 workers (which would explain the slow down after 10 iterations).
What’s the difference between the new output and the previous one?

Second scenario would load data and run the full model. First scenario would only work on a partition of the model. Hence the calculation time is comparatively short.

I guess that would correspond to your explanation. I will check this again with the timeline. Thanks for the explanation in the meantime.