Hi,
I am testing Microsoft’s pipedream approach for distributed learning (https://github.com/msr-fiddle/pipedream)
I am using Alexnet with a subset of ImageNet data and testing the approach using 4 K80 GPUs.
I am observing a strange behavior in the data loader timing in the first GPU in the pipeline. Pipedream code uses a dataloader as follows.
train_dataset = datasets.ImageFolder(
traindir,
transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
]))
...
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
num_workers=args.workers, pin_memory=True, sampler=train_sampler, drop_last=True)
Issue is, I am passing args.workers=10
, and every multiple of args.workers
batch (ex: 10, 20, …) takes significantly higher amount of time in the next(iter(train_loader))
call.
batch | time on next(…) |
---|---|
0 | 6.153 |
1 | 0.001 |
… | … |
10 | 5.408 |
11 | 0.001 |
… | … |
20 | 4.916 |
21 | 0.001 |
… | … |
I have checked for multiple batch sizes (64, …, 256, 512) and by changing workers 4 to 10.
Can someone shed some light on this?
Best