Can pytorch load code from disk after running for a while?

greaber · September 6, 2020, 8:41pm

I had a surprising situation where my training was running for a while and then it crashed due to a syntax error because I had edited the script after I started the training and introduced a syntax error. I am using distributed data parallel and torch.utils.data, which I know both spawn workers. But I thought they just did this at launch. Can someone explain how this could happen?

Possibly related: I noticed that after validation runs I get warning messages like /home/grant/miniconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown len(cache)), which I normally only would get when the whole script stops. (By the way, I have no idea what these warnings mean – do they indicate that I am doing something a little wrong, or are they just unavoidable?)

albanD · September 7, 2020, 8:37pm

Hey,

I would guess this is due to this issue: DataLoader with option to re-use worker processes · Issue #15849 · pytorch/pytorch · GitHub
The DataLoader recreates the worker processed regularly right now I’m afraid.

By the way, I have no idea what these warnings mean – do they indicate that I am doing something a little wrong, or are they just unavoidable?

Do you do any custom multiprocessing? Or you just use the DataLoader class?

greaber · September 7, 2020, 9:00pm

Thanks! This behavior surprised me. I’m not doing any custom multiprocessing except for one call to torch.multiprocessing.spawn to use distributed data parallel.

albanD · September 8, 2020, 2:25pm

I’m not sure then… If you don’t have any deadlock issues, I would ignore the warning