In what order do dataloader workers do their job?

Jochem_Grietens · July 7, 2020, 9:16pm

Hello,

Hello, i was wondering how the dataloder with num_workers > 0 queu works. I imagine N wokers are created.
I see 2 options:

the program goes through all workers in sequence? This would mean that if one worker is delayed for some reason, the other workers have to wait until this specific worker can deliver the goods. This would also mean that if a worker gets stuck into an infinite loop while fetching its data, the training process is going to be waiting forever.
The program takes a batch from whichever worker is done first. This means that if a worker gets into an infinite loop, another worker will always be able to get the training process new data. In this case i am wondering what would happen to the worker in the infinite loop. Does it get killed, does it get sidelined in some way ?

To give some context to my reason for asking this question:
I am doing object detection. And in my data loader i have a transform that does a random crop. However, i don’t want to do random crops with no bounding boxes contained within it. So i wrote a loop inside the transform that does random crops on the same image until it finds a candidate with a bounding box contained in it.

My fear is that there might be cases in which no such crop is ever found ( maybe there is no bounding box in the image). In this case i want to understand how the dataloader/training process would deal with this.

Thanks a lot.

ptrblck · July 9, 2020, 2:06am

Yes, your first point is correct. If I remember if correctly, the order of the returned batches wasn’t defined some time ago (in 0.4?), which could yield a non-deterministic behavior. Based on this code snippet, you can see that worker0 would slow down worker1 and the output order would still be the expected one.

import torch
from torch.utils.data import Dataset, DataLoader


class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.arange(100)

    def __getitem__(self, index):
        worker_info = torch.utils.data.get_worker_info()
        if worker_info:
            worker_id = worker_info.id
            print('worker_id {} calling with index {}'.format(worker_id, index))
            if worker_id == 0:
                print('slowing down worker0')
                a = 0.
                for idx in range(10000000):
                    a += idx
        x = self.data[index]
        return x

    def __len__(self):
        return len(self.data)


if __name__=='__main__':
    dataset = MyDataset()
    loader = DataLoader(dataset, batch_size=2, num_workers=2, timeout=1)

    for data in loader:
        print(data)

If you slow down worker0 even more, the timeout threshold (passed to the DataLoader) would kick in and raise an error.
Note that the timeout is set to 0 by default, so that your code could hang, if a worker isn’t returning from your processing logic.