Dear Fellow Community Members,
I have create some sort of a training framework in which I can create multiple training jobs and train them parallel. I run each training job in its own Thread which is working fine (see blow how the thread is created). However, the dataloader stops “working” when I assign multiple workers to the dataloader (num_workers>0). It appears that the batch is never returned when iterating the dataloader. So it is stuck there without any error msg. Nothing happens. The for loop waits for the iter to return a batch but is does not arrive. This does not happen for me if I train outside of a Thread with multiple workers. When the num_workers=0 the training works just fine within the threads. Here is a little test script to simulate this problem.
from threading import Thread
import torch
class dataset():
def __getitem__(self, index):
image = torch.rand((3,224, 224))
label = 1
return image, label
def __len__(self):
return 96
def job(num_workers):
print('start job with {} workers'.format(num_workers))
dataLoader = torch.utils.data.DataLoader(dataset=dataset(),
batch_size=32,
shuffle=True,
num_workers=num_workers)
for i, batch in enumerate(dataLoader):
print('{} | batch id = {}/{}'.format(num_workers, i+1, len(dataLoader)))
print('run job outside of thread with 0 workers')
job(num_workers=0)
print('run job outside of thread with 4 workers')
job(num_workers=4)
print('run job in thread with 0 workers')
thread = Thread(target=job, args=[0]) # job is my job function
thread.daemon = False
thread.start()
thread.join()
print('run job in thread with 4 workers')
thread = Thread(target=job, args=[4]) # job is my job function
thread.daemon = False
thread.start()
thread.join()
print('run two threads together with both 0 workers')
thread = Thread(target=job, args=[0]) # job is my job function
thread.daemon = False
thread2 = Thread(target=job, args=[0]) # job is my job function
thread2.daemon = False
thread.start()
thread2.start()
thread.join()
thread2.join()
print('run two threads together with 0 and 4 workers')
thread = Thread(target=job, args=[0]) # job is my job function
thread.daemon = False
thread2 = Thread(target=job, args=[4]) # job is my job function
thread2.daemon = False
thread.start()
thread2.start()
thread.join()
thread2.join()
Output:
run job outside of thread with 0 workers
start job with 0 workers
0 | batch id = 1/3
0 | batch id = 2/3
0 | batch id = 3/3
run job outside of thread with 4 workers
start job with 4 workers
4 | batch id = 1/3
4 | batch id = 2/3
4 | batch id = 3/3
run job in thread with 0 workers
start job with 0 workers
0 | batch id = 1/3
0 | batch id = 2/3
0 | batch id = 3/3
run job in thread with 4 workers
start job with 4 workers
4 | batch id = 1/3
4 | batch id = 2/3
4 | batch id = 3/3
run two threads together with both 0 workers
start job with 0 workers
start job with 0 workers
0 | batch id = 1/3
0 | batch id = 1/3
0 | batch id = 2/3
0 | batch id = 2/3
0 | batch id = 3/3
0 | batch id = 3/3
run two threads together with 0 and 4 workers
start job with 0 workers
start job with 4 workers
0 | batch id = 1/3
0 | batch id = 2/3
0 | batch id = 3/3
So the second thread with 4 workers never starts to return batches.
Does anyone have an idea what is the cause to this issue?
Best,
Paul