Torch multiprocessing update semantics for CPU and GPU

I’m unable to find any summary of the update/memory model semantics for torch multiprocessing tensors.

Let’s consider the Hogwild mini-example from the docs.

Suppose that the model resides in shared CPU memory. When the child processes invoke optimizer.step, they asyncronously update the shared parameter values according to within-process gradient values. If the optimizer is SGD, this boils down to a sub_ or something on the parameter data.

Is this subtraction atomic? I.e., could we lose writes as child processes contended the shared parameter memory.

More pressingly for my use case, suppose the shared parameters are all CUDA tensors on the same device. Do we still get atomicity?