Please consider this simple fragment of code:
[...] print("resetting metrics") for m in self.metrics: m.reset() print("done") for bi, (images, labels, text) in enumerate(dl): if bi == 0: print("IMAGES:", images.shape) print("LABELS:", labels.shape) print("TEXT:", text.shape) [...]
Normally, my code runs fine, but, every once in a while, it gets stuck. Indeed, the only output is:
* EPOCH 1/1000, START calling concrete _epoch() method resetting metrics done
and nothing else. If I interrupt the process with
CTRL+C, I read this:
[...] ^CError in atexit._run_exitfuncs: Traceback (most recent call last): File "/opt/conda/envs/torch/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt
that seems to indicate it was stuck in
popen_fork.py waiting for some other process to finish.
I know the fragment is very small, but is there anyone who knows what I could do to try and solve this problem? I cannot reproduce it, meaning that it normally runs fine, but sometimes it does not.
I do not use explicitly any multiprocessing functions, just “plain” pytorch code. However, the
DataLoader is configured to use multiple workers.