Do multiple Dataloader processes share resource reader instances?

mattrobin · November 8, 2017, 2:28am

I’m running into a problem with my data file reader, and I believe it may be because this specific file reader can’t be used by multiple processes at the same time. Does the Dataloader’s processes share the file readers? That is, if my custom Dataset has a self.my_file_reader = MyFileReader(file_path), which is then used during a __getitem__, is the same reader being used by all of the Dataloader processes? Or does each process open it’s own instance of MyFileReader? If they all are sharing the one instance, is there any easy way to change that so that they each create their own instance? Thank you!

SimonW · November 8, 2017, 2:36am

Different processes do not share memory. So after the loader is running, they are different objects.

mattrobin · November 8, 2017, 2:39am

Thank you! Now, I just need to figure out what else is causing the problem…

SimonW · November 8, 2017, 2:40am

What is the issue specifically?

mattrobin · November 8, 2017, 2:57am

I’m using imageio's video reader to read frames from a video. I’m getting an error (seemingly consistently between 10,000 to 20,000 steps of training, never mind specifically how much that is) that imageio.core.format.CannotReadFrameError: Could not read frame -1: Frame is 0 bytes, but expected 1244160. This is not due to me passing in the wrong index (this causes different errors when I purposely cause this to happen). It seems to be internal to imageio, and I thought it might be due to the multiprocessing in someway (it still possibly might be, but not due to a shared reader).

I had suspected the multiprocessing due to a similar issue someone else mentioned, but they had resolved it by creating separate instances of the reader.

SimonW · November 8, 2017, 2:59am

It can be related indeed. Can you try num_workers = 0 and see if the issue still happens?

mattrobin · November 8, 2017, 3:01am

Yep, I’m currently running two machines with num_workers = 0 and 1 to see, but it will take an hour or so before they get around the number of steps it’s been consistently happening at.

Update: Got past 20,000 steps without issue. So it seems likely it is due to multiple readers reading from the same file. Thanks for you help!

Update 2: num_workers = 0 works. num_workers = 1 still results in the same error.

smth · November 9, 2017, 11:02pm

if you are using python3, use the option (at the top of your main script, before you import imageio):

import torch
if __name__ == "__main__":
    import torch.multiprocessing
    import torch.multiprocessing.set_start_method("spawn")

I wonder if imageio is not fork-safe, and hence your error.

mattrobin · November 12, 2017, 8:55pm

Sorry for the delayed testing of your suggestion.

Adding this in directly results in a TypeError: can't pickle _thread.lock objects. I suspected this was from the imageio reader, so I temporarily tried removing this from the my custom Dataset object. When doing so I get the following error:

RuntimeError:
    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Which is surprising, as the call you recommended was in a __name__ == '__main__' block and is being called on an Ubuntu OS (and from what I can find I should only expect this error on Windows).

BBCraker · December 27, 2020, 12:21pm

Any update on this issue? @smth @smth @mattrobin Thanks in advance!

BBCraker · January 17, 2021, 5:57pm

I run into a similar issue in DDP where I loaded tfrecord dataset with more than 1 number_workds and I got the same error. TypeError: can’t pickle _thread.lock objects.
Any update? Thanks in advance!

wangjf8090 · May 28, 2022, 7:55am

I get the same problem when I was running two pytorch process for the same task , load different dataloader ( distinguish by diff name), some error happened during the training like one dataloader caused 0 accuracy, it also happened when i use two dataloader of same file, the second dataloader will get 0 accuracy after model test. Maybe it’s because of the file_reader is the same one even used different dataloader? I can’t find the reason.

wangjf8090 · May 28, 2022, 8:22am

update:
when i set the num_worker to 0, the two dataloader can work correctly.