Initiating ImageFolder dataset becomes extremely slow when multiple tasks are running

zeakey · September 10, 2018, 10:38am

When multiple ImageNet tasks are running on the same machine, the initiating of torchvision.datasets.ImageFolder becomes extremelllly slow.

Following codes take about 2min in a single training task:

def ilsvrc2012(path, bs=256):
    traindir = os.path.join(path, 'train')
    valdir = os.path.join(path, 'val')
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    train_dataset = datasets.ImageFolder(
        traindir,
        transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]))

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=bs, shuffle=True,
        num_workers=8, pin_memory=True)

    val_loader = torch.utils.data.DataLoader(
        datasets.ImageFolder(valdir, transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ])),
        batch_size=bs, shuffle=False,
        num_workers=8, pin_memory=True)
    return train_loader, val_loader

But when I start to run the other task that reads the same image data, the same codes unbelievably cost about half hour!

The images are stored on a SSD disk with SATA connection to the main board.

System Info:

uname -a
Linux Monster 4.13.0-45-generic #50~16.04.1-Ubuntu SMP Wed May 30 11:18:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Torch Info:

python -c "import torch; print(torch.__version__)"
0.4.1

I just wonder what causes this inconceivable performance gap.

More information can be updated upon request.

ptrblck · September 11, 2018, 7:42am

Could you measure your disk I/O?
I guess the multiple processed just block each other reading from your SSD.
iostat might be a good starter to check for waits.

zeakey · September 11, 2018, 4:29pm

Could please give more detailed instruction about how to measure my disk IO?

zeakey · September 11, 2018, 4:42pm

Well following is the outputs of command iostat:

Linux 4.13.0-45-generic (Monster) 	2018年09月12日 	_x86_64_	(48 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.63    0.00    2.17    1.75    0.00   84.44

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             130.83       623.48       729.04   75656481   88466496
sdb              27.32      1358.62        26.68  164862720    3237276
sdc            1978.66     95973.51       221.50 11645990012   26878608
sdd               0.68        11.31       245.21    1371866   29754852

During which two imagenet training tasks are initiating their dataloaders, and the data is stored on /dev/sdc1.

ptrblck · September 11, 2018, 11:20pm

Have a look at this guide. What kind of SSD is your sdc1?
Could you run your workload again and call iostat -mx 1?

zeakey · September 12, 2018, 4:21am

My SSD is this one https://www.amazon.com/Kingston-HyperX-Savage-SHSS3B7A-240G/dp/B00W35L6DA.

I’ll furthur investigate.