Error while Multiprocessing in Dataloader

I just got this error. My data is good. And training with num_workers=0 is too slow. For whatever reason, I was able to fix by replacing
from tqdm.auto import tqdm
with just
from tqdm import tqdm
Something seems to bug out with parallel dataloaders wrapped around the fancy notebook tqdm with my versions of nodejs and ipywidgets. Hope this helps others.

27 Likes

This solve my problem.
Thanks a lot!

This is literally gold!

This solves my problem, too.

but why???

This helped! Thank you!

I’m not even using tqdm and my code works fine with num_workers=0. What could be the problem?

2 Likes

Same here, no tqdm, code worked with num_workers=0, 1, 2, but saw a lot of these errors when num_worers>=3.
I ran the code inside docker and increasing the shared memory size (–shm-size 256M → 1G) solved the problem for me, now works fine with num_workers=12.

That works for me. Thanks!

Not using tqdm but changing num_workers from 1 to 0 caused this error to go away on my Colab run! :slight_smile:

2 Likes

that really helps by replacing
from tqdm.auto import tqdm
with just
from tqdm import tqdm

Thx a lot!

I just changed the num_workers to num_workers=0, run the training once and then change it back again to num_workers=4. The error just disappeared afterwards

1 Like

The warnings were annoying me a lot. Thanks :))

FWIW - if using pytorch-lightning the suggested import solution did not help. In my case, a custom data set was generating ALL data on the fly. Altering it to pregenerate substantial data caused the issue to go away, even with non-zero workers.

Hi sneiman, could you please add more details about “generating ALL data on the fly” and “pregenerate substantial data”? I’m also using pytorch-lightning and this is really a trouble.

My initial synthetic data set generated the data on demand – meaning the data set did not have any data ‘pre cached’. Like a generator in python. The data was not created until the *getItem() call.* This had the problem, unless I set the number of workers to 0.

I guessed that perhaps not having data available was confusing the splitting up of the data job among a number of workers. So, I changed the dataset to create all of the data when the constructor was called. This solved the problem – I usually run this with 8 workers now.

Hope this helps.

sneiman

Thanks for your reply! This is curious since as far as I know the dataloader workers will prefetch batches automatically. I could not verify this as I have a lot of videos that will not fit into memory. I’ll try to dig deeper. Thank you again.

I thought it was odd as well – I had imagined the data loaders simply make lots of calls on getitem(). However, PTL does do a lot of its own multiprocessing management, particularly with DDP.

Good luck,

s

In the dataloader, “persistent_workers= True” works for me.

3 Likes

Thank you bro, it’s worked!

This worked for me, too. Thank you!