Please consider this simple fragment of code:
[...]
print("resetting metrics")
for m in self.metrics:
m.reset()
print("done")
for bi, (images, labels, text) in enumerate(dl):
if bi == 0:
print("IMAGES:", images.shape)
print("LABELS:", labels.shape)
print("TEXT:", text.shape)
[...]
Normally, my code runs fine, but, every once in a while, it gets stuck. Indeed, the only output is:
* EPOCH 1/1000, START
calling concrete _epoch() method
resetting metrics
done
and nothing else. If I interrupt the process with CTRL+C
, I read this:
[...]
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/opt/conda/envs/torch/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
that seems to indicate it was stuck in popen_fork.py
waiting for some other process to finish.
I know the fragment is very small, but is there anyone who knows what I could do to try and solve this problem? I cannot reproduce it, meaning that it normally runs fine, but sometimes it does not.
I do not use explicitly any multiprocessing functions, just “plain” pytorch code. However, the DataLoader
is configured to use multiple workers.