DataLoader gets much slower with more images


I use a modest configuration of the dataloader:

    batchsize = 64
    n_workers = 8
    dl = DataLoader(ds,
            batch_size = batchsize,
            num_workers = n_workers,
            pin_memory = False,
            drop_last = True

The ds is some dataset object that uses cv2 to read and uses scipy to do a few preprocessing. The images will be converted to tensors with transformations.to_tensor().
The problems is that, with my 600,000 validation images, an epoch of only iterating data (only read images without running through the model) takes about 7min. However, with my 3,800,000 training images, an epoch of only iterating data takes much longer than 7x6min. Besides, I got to find that the operation system seems to be a little slower when I iterates my training data. I store my training and validation sets each into 30 category folders. I believe I have no other strange operation. What is the cause of this please?

Hi, perhaps you are saving state somehow, like opening files but not closing them or appending to a list that just gets bigger. In Ubuntu you can run

lsof | wc -l

as sudo and it will show open processes. Try running it after like 10 seconds, then again every 30 seconds to see if it increases. If it does, something is opening but not closing. Could be outside of your datareading as well like in your training loop. If you try this and it does increase we can try to dig deeper

Yes, that number increases as my training continues. I have modified my code to avoid opening each image without closing it. Thanks a lot!!!

1 Like