Hey @Lewis_Liu, which part of the program is distributed? Since torch.distributed does not support Windows yet, I assume the working version of the program on Windows does not use distributed training?
The training isn’t distributed and torch.distributed isn’t used.
By distributed I mean the workers used to collect data are distributed and the network params are send from trainer to these workers through mp.queue.
Once the data are collected and trainer starts to train, the workers stop working so I suppose there’s no interaction between the workers and the trainer. So what appears really strange to me is that the backward is done but step is not.
I just added the line and the prints are the same.
FYI, after it hangs there, I killed the program and it showed this. I’m not sure if this is helpful
Traceback (most recent call last):
File “test.py”, line 23, in
File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/multiprocessing/process.py”, line 140, in join
res = self._popen.wait(timeout)
File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/multiprocessing/popen_fork.py”, line 48, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/multiprocessing/popen_fork.py”, line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
Are you using torch.multiprocessing.SimpleQueue? If yes, does the program guarantee that the owner of the shared data object is still alive when the user uses it when sharing CPU tensors? And are you using spawn to create processes?
Unlike CPU tensors, the sending process is required to keep the original tensor as long as the receiving process retains a copy of the tensor. The refcounting is implemented under the hood but requires users to follow the next best practices.
Not completely solved but was able to found what was the issue and found a way around. The issue is that the network was somehow shared with other processes. So my practical suggestion would be check everything that might lead to your network being shared/visited. e.g. mistake in using copy.copy or deepcopy to send the statedict