Error while Multiprocessing in Dataloader

gaurav0651 · June 2, 2019, 2:47am

Not sure if this is reported already but I am getting the following Assertion error in Dataloader

Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fae94071d30>>
Traceback (most recent call last):
File “/home/amit/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 677, in del
self._shutdown_workers()
File “/home/amit/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 659, in _shutdown_workers
w.join()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 122, in join
assert self._parent_pid == os.getpid(), ‘can only join a child process’
AssertionError: can only join a child process

matohak · January 24, 2020, 2:42pm

have you found a solution?

uduse · March 8, 2020, 6:01am

I am using num_workers with IterableDataset and it also has this problem.

karzia · May 8, 2020, 1:25am

yes…
same here.

testset = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True,num_workers=0)

But i just worried that it is possible to use only my local testing.
So I just want to know what is root cause and a solution.

^^

sharkdeng · July 22, 2020, 2:28am

did you solve it. Or it is the problem of num_workers.

Himani_Gulati · August 13, 2020, 8:48pm

Well I am getting the same error, it says can only join a child process. I do not know what that means??

Gabriel_Barello · August 20, 2020, 5:17am

I was having this issue. Turns out its because there was an error in the dataset object (for me it was in the __getitem__ function). I guess the DataLoader in multiprocessing mode doesn’t know how to cleanly provide you with the internal error message. If you have the same problem, try running with num_workers = 0 (single-threaded) and it should tell you what the error is. Once you’ve fixed the error, it should work with num_workers > 0.

Qinsheng_Zhang · October 1, 2020, 4:44am

same issue happens. Not know where is the error.

jpleet · October 21, 2020, 8:44pm

I just got this error. My data is good. And training with num_workers=0 is too slow. For whatever reason, I was able to fix by replacing
from tqdm.auto import tqdm
with just
from tqdm import tqdm
Something seems to bug out with parallel dataloaders wrapped around the fancy notebook tqdm with my versions of nodejs and ipywidgets. Hope this helps others.

JamesLai · November 30, 2020, 2:42pm

This solve my problem.
Thanks a lot!

Hmrishav_Bandyopadhy · December 12, 2020, 8:59am

This is literally gold!

brianw0924 · March 26, 2021, 2:03pm

This solves my problem, too.

but why???

adrofa · April 1, 2021, 10:08am

This helped! Thank you!

DarQ · June 23, 2021, 8:21pm

I’m not even using tqdm and my code works fine with num_workers=0. What could be the problem?

akos.kiss · June 27, 2021, 6:23am

Same here, no tqdm, code worked with num_workers=0, 1, 2, but saw a lot of these errors when num_worers>=3.
I ran the code inside docker and increasing the shared memory size (–shm-size 256M → 1G) solved the problem for me, now works fine with num_workers=12.

Fang_Nan · September 3, 2021, 2:27pm

That works for me. Thanks!

drscotthawley · November 2, 2021, 3:56am

Not using tqdm but changing num_workers from 1 to 0 caused this error to go away on my Colab run!

Noyii · December 20, 2021, 12:37pm

that really helps by replacing
from tqdm.auto import tqdm
with just
from tqdm import tqdm

Thx a lot!

steve_ari · May 3, 2022, 7:31am

I just changed the num_workers to num_workers=0, run the training once and then change it back again to num_workers=4. The error just disappeared afterwards

agg-shambhavi · May 29, 2022, 1:26pm

The warnings were annoying me a lot. Thanks :))