I am testing Microsoft’s pipedream approach for distributed learning (https://github.com/msr-fiddle/pipedream)
I am using Alexnet with a subset of ImageNet data and testing the approach using 4 K80 GPUs.
I am observing a strange behavior in the data loader timing in the first GPU in the pipeline. Pipedream code uses a dataloader as follows.
train_dataset = datasets.ImageFolder( traindir, transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), normalize, ])) ... train_loader = torch.utils.data.DataLoader( train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None), num_workers=args.workers, pin_memory=True, sampler=train_sampler, drop_last=True)
Issue is, I am passing
args.workers=10, and every multiple of
args.workers batch (ex: 10, 20, …) takes significantly higher amount of time in the
|batch||time on next(…)|
I have checked for multiple batch sizes (64, …, 256, 512) and by changing workers 4 to 10.
Can someone shed some light on this?