Pytorch MP manager dict hang in linux

I’m trying to share the weights of networks between processes using multiprocessing manager.dict
The code is as follows

    for name in critic_state_dict:
        self.shared_networks.critic[name] = T.tensor(critic_state_dict[name].clone().cpu().detach().numpy())

This works fine in windows. But when I use a cluster, it hangs in the middle of the for loop

How do I fix this? Or if I want to periodically share the weights among processes, how to do it properly?
Thanks

Hey @Lewis_Liu,

Did you use fork or spawn?

Or if I want to periodically share the weights among processes, how to do it properly?

One solution is to create a multiprocessing queue, and pass that queue to child processes. Then, in the loop, use that queue to pass shared tensors. The test below can serve as an example:

Hi Li,

I switched to using Queue. But I cannot avoid firstly getting the net tensors from state_dict right?

I believe it’s spawn on my windows workstation and it’s fork on the linux cluster if i’m correct

I don’t have the full context here. Can you let the processing holding the state_dict be the writer to the queue?

Yep.

The network is trained and updated for a step. After this, the process has only one sole task that is to write the state_dict into the queue. Other processes doesn’t have direct access to the network except through the queue