Pytorch default dataloader gets stuck for large image classification training set

swooders · February 11, 2020, 4:42pm

I am training image classification models in Pytorch and using their default data loader to load my training data. I have a very large training dataset, so usually a couple thousand sample images per class. I’ve trained models with about 200k images total without issues in the past. However I’ve found that when have over a million images in total, the Pytorch data loader get stuck.

I believe the code is hanging when I call datasets.ImageFolder(...) . When I Ctrl-C, this is consistently the output:

Traceback (most recent call last):                                                                                                 │
  File "main.py", line 412, in <module>                                                                                            │
    main()                                                                                                                         │
  File "main.py", line 122, in main                                                                                                │
    run_training(args.group, args.num_classes)                                                                                     │
  File "main.py", line 203, in run_training                                                                                        │
    train_loader = create_dataloader(traindir, tfm.train_trans, shuffle=True)                                                      │
  File "main.py", line 236, in create_dataloader                                                                                   │
    dataset = datasets.ImageFolder(directory, trans)                                                                               │
  File "/home/username/.local/lib/python3.5/site-packages/torchvision/datasets/folder.py", line 209, in __init__     │
    is_valid_file=is_valid_file)                                                                                                   │
  File "/home/username/.local/lib/python3.5/site-packages/torchvision/datasets/folder.py", line 94, in __init__      │
    samples = make_dataset(self.root, class_to_idx, extensions, is_valid_file)                                                     │
  File "/home/username/.local/lib/python3.5/site-packages/torchvision/datasets/folder.py", line 47, in make_dataset  │
    for root, _, fnames in sorted(os.walk(d)):                                                                                     │
  File "/usr/lib/python3.5/os.py", line 380, in walk                                                                               │
    is_dir = entry.is_dir()                                                                                                        │
Keyboard Interrupt

I thought there might be a deadlock somewhere, however based off the stack output from Ctrl-C it doesn’t look like its waiting on a lock. So then I thought that the dataloader was just slow because I was trying to load a lot more data. I let it run for about 2 days and it didn’t make any progress, and in the last 2 hours of loading I checked the amount of RAM usage stayed the same. I also have been able to load training datasets with over 200k images in less than a couple hours in the past. I also tried upgrading my GCP machine to have 32 cores, 4 GPUs, and over 100GB in RAM, however it seems to be that after a certain amount of memory is loaded the data loader just gets stuck.

I’m confused how the data loader could be getting stuck while looping through the directory, and I’m still unsure if its stuck or just extremely slow. Is there some way I can change the Pytortch dataloader to be able to handle 1million+ images for training? Any debugging suggestions are also appreciated!

Thank you!