Errors when using num_workers>0 in DataLoader

Hi all, I’m facing a problem when setting the num_workers value in the DataLoader bigger than 0.
In particular I’m trying to train a custom model on a custom dataset.
Thus, if I hold the num_workers=0, everything it’s fine and the whole process is successful.
But for every other configuration on num_workers, the problem persist for every setting I try with the batch_size used, number of epochs for traning, ecc.
And it seems that when num_workers is greater than 0, x for istance, the script try to run x times and then the error came out.

The strange issue comes when I try to run the script with the bottleneck utility with the num_workers settings bigger then 0, and in this way it works correctly.
So my question is if the bottleneck utility applies some type of optimization that I don’t do, or when we want to set the num_workers bigger than 0 we need to so something in particular.

P.S.
I want to use the num_worker>0 in order to push my GPU to the max usage.

Thanks.

1 Like

Num_workers sets the number of CPU workers in the data loader only. This has nothing to do with GPU utilization - although faster batch preprocessing will lead to batches being loaded faster and thus more streamlined GPU usage. On Windows, due to multiprocessing restrictions, setting num_workers to > 0. This is expected, so don’t worry about it. You can set it to 0.

Thanks for the reply!
But I saw that when I run the program with the bottleneck tool, where num_workers is > 0, the GPU “cuda usage” is much more efficient and stable, in the sense that it is always at about 95-100%.
Instead, when I had to run the script normally without the bottleneck tool and num_worker is 0, the GPU cuda usage is more discontinuous (it goes to 50%, to 90% and it turns to 25 and so on), unstable.
So, this behaviour make me think that it has something to do with the gpu usage.

Ok, it seems that I have solved the problem, and now I can set the num_workers>0.
What I did was just to insert all the code that train the network in a function: train(), summaring:

def train():
    # Here was inserted the whole code that train the network ...
if __name__ == '__main__':
    train()

Could anyone explain how and why this works?

You can have a look here: https://docs.python.org/3.7/library/multiprocessing.html#multiprocessing-programming

Note that Windows can still cause issues here, be it pickling errors or slow performance. See https://github.com/pytorch/pytorch/issues/12831. The safest approach is using 0 on Windows.

Ok, thanks you anyway!

Thanks for the solution. It worked for me.
But, same question. How did it solve the problem?