Multiple Dataloader Workers in multi Threading

Dear Fellow Community Members,

I have create some sort of a training framework in which I can create multiple training jobs and train them parallel. I run each training job in its own Thread which is working fine (see blow how the thread is created). However, the dataloader stops “working” when I assign multiple workers to the dataloader (num_workers>0). It appears that the batch is never returned when iterating the dataloader. So it is stuck there without any error msg. Nothing happens. The for loop waits for the iter to return a batch but is does not arrive. This does not happen for me if I train outside of a Thread with multiple workers. When the num_workers=0 the training works just fine within the threads. Here is a little test script to simulate this problem.

    from threading import Thread
    import torch


    class dataset():
            def __getitem__(self, index):
                    image = torch.rand((3,224, 224))
                    label = 1
                    return image, label
    
    def __len__(self):
            return 96

    def job(num_workers):
            print('start job with {} workers'.format(num_workers))
            dataLoader = torch.utils.data.DataLoader(dataset=dataset(),
                                         batch_size=32,
                                         shuffle=True,
                                         num_workers=num_workers)
    for i, batch in enumerate(dataLoader):
            print('{} | batch id = {}/{}'.format(num_workers, i+1, len(dataLoader)))

    print('run job outside of thread with 0 workers')
    job(num_workers=0)
    print('run job outside of thread with 4 workers')
    job(num_workers=4)

    print('run job in thread with 0 workers')
    thread = Thread(target=job, args=[0])  # job is my job function
    thread.daemon = False
    thread.start()
    thread.join()

    print('run job in thread with 4 workers')
    thread = Thread(target=job, args=[4])  # job is my job function
    thread.daemon = False
    thread.start()
    thread.join()

    print('run two threads together with both 0 workers')
    thread = Thread(target=job, args=[0])  # job is my job function
    thread.daemon = False
    thread2 = Thread(target=job, args=[0])  # job is my job function
    thread2.daemon = False
    thread.start()
    thread2.start()
    thread.join()
    thread2.join()

    print('run two threads together with 0 and 4 workers')
    thread = Thread(target=job, args=[0])  # job is my job function
    thread.daemon = False
    thread2 = Thread(target=job, args=[4])  # job is my job function
    thread2.daemon = False
    thread.start()
    thread2.start()
    thread.join()
    thread2.join()

Output:

run job outside of thread with 0 workers
start job with 0 workers
0 | batch id = 1/3
0 | batch id = 2/3
0 | batch id = 3/3
run job outside of thread with 4 workers
start job with 4 workers
4 | batch id = 1/3
4 | batch id = 2/3
4 | batch id = 3/3
run job in thread with 0 workers
start job with 0 workers
0 | batch id = 1/3
0 | batch id = 2/3
0 | batch id = 3/3
run job in thread with 4 workers
start job with 4 workers
4 | batch id = 1/3
4 | batch id = 2/3
4 | batch id = 3/3
run two threads together with both 0 workers
start job with 0 workers
start job with 0 workers
0 | batch id = 1/3
0 | batch id = 1/3
0 | batch id = 2/3
0 | batch id = 2/3
0 | batch id = 3/3
0 | batch id = 3/3
run two threads together with 0 and 4 workers
start job with 0 workers
start job with 4 workers
0 | batch id = 1/3
0 | batch id = 2/3
0 | batch id = 3/3

So the second thread with 4 workers never starts to return batches.
Does anyone have an idea what is the cause to this issue?

Best,
Paul

1 Like