Can pytorch load code from disk after running for a while?

I had a surprising situation where my training was running for a while and then it crashed due to a syntax error because I had edited the script after I started the training and introduced a syntax error. I am using distributed data parallel and torch.utils.data, which I know both spawn workers. But I thought they just did this at launch. Can someone explain how this could happen?

Possibly related: I noticed that after validation runs I get warning messages like /home/grant/miniconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown len(cache)), which I normally only would get when the whole script stops. (By the way, I have no idea what these warnings mean – do they indicate that I am doing something a little wrong, or are they just unavoidable?)

Hey,

I would guess this is due to this issue: https://github.com/pytorch/pytorch/issues/15849
The DataLoader recreates the worker processed regularly right now I’m afraid.

By the way, I have no idea what these warnings mean – do they indicate that I am doing something a little wrong, or are they just unavoidable?

Do you do any custom multiprocessing? Or you just use the DataLoader class?

Thanks! This behavior surprised me. I’m not doing any custom multiprocessing except for one call to torch.multiprocessing.spawn to use distributed data parallel.

I’m not sure then… If you don’t have any deadlock issues, I would ignore the warning :smiley: