I had a surprising situation where my training was running for a while and then it crashed due to a syntax error because I had edited the script after I started the training and introduced a syntax error. I am using distributed data parallel and torch.utils.data
, which I know both spawn workers. But I thought they just did this at launch. Can someone explain how this could happen?
Possibly related: I noticed that after validation runs I get warning messages like /home/grant/miniconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown len(cache))
, which I normally only would get when the whole script stops. (By the way, I have no idea what these warnings mean – do they indicate that I am doing something a little wrong, or are they just unavoidable?)