Synchronization for sharing/updating shared model state dict across multi-process

Hello, while I’m doing a RL project, I need some way to share model state dict across multi-processes.

I found the A3C implementation which has this feature, but I have some questions about it.

  1. In the above repo, the author defined model instance on CPU shared memory, and then all the sub processes also share that. In each subprocess, it has its own model instance on GPU, and it loads the state from globally shared CPU model instance without any lock.
    So my question is, Is it guranteed that it’s safe to access parameters of the model on CPU shared memory simultaneously in multiple processes without any lock or synchronization?
    I’m pretty sure it’s fine since model parameters are treated as read-only variable here, but I just want to double check it.

  2. The next question is my main interest. At here, the author simply copies the gradient of model in sub-process to shared model without any lock mechanism, and also optimization is done without synchronization. How come is it possible? Why isn’t everything messed up when multiple processes overwrite the gradient of shared model without any lock mechanism?

If I missed out some points, please let me know :frowning:

1 Like

Okay, I found there’s already well documented answer at https://pytorch.org/docs/stable/notes/multiprocessing.html.
And also with hogwild!-spirit, it’s fine to overwrite gradient from other process.

So the only thing I wonder now is that how the updating shared model parameter could be done without any lock mechanism. Are the add_ or addcdiv_ atomic operation?

2 Likes

I have the same question, and I think it’s not atomic operatino.

I found a possible answer. @Yanzzzzz @wwiiiii

And I found another related answer