Separate gradients when using multiprocessing

fdraxler · September 15, 2017, 1:54pm

I want to train an ensemble of networks on a single GPU. The dataset and the model are small enough for this. When computing the gradients, I get wrong results (i.e. different numerical values than if I worked with a single process).

This is my setup:
I use torch.multiprocessing to spawn new processes. The main nn.Module is created in the main process and passed on to the subprocesses. Since all parameters are then still pointing to the same Cuda.{...}Tensors, I assign copied versions of the Parameters as follows in each process:

# Copy parameters so that independent processes have independent parameters
for mod in model.modules():
    for name, parameter in mod._parameters.items():
        var_clone = parameter.clone()
        mod.register_parameter(name, Parameter(var_clone.data))

If I didn’t do that, the parameters would be shared between the processes. However, I want an ensemble of networks with independent parameters (not hogwild).

When computing the gradient, the parameters still seem to share the parameters’ .grad information in a race condition. So sometimes the gradient information mixes and wrong gradients are computed.

From the documentation at http://pytorch.org/docs/master/notes/multiprocessing.html, only Cuda.{...}Tensors are shared between processes when using multiprocessing. Here, I seem to encounter shared autograd.Variables. Am I missing something here?

(Sidenote: There are ways to circumvent this, e.g. by re-creating the main nn.Module in each thread. Then the parameters can never be shared. This requires a major restructurisation of my code.)

smth · October 3, 2017, 5:27pm

a simple / dumb thing to do is to save params to a disk checkpoint and load them in each subprocess.
You can also simply do a model = copy.deepcopy(model) and that should work too.

fdraxler · October 13, 2017, 9:38am

Thanks for your feedback!

To finish this discussion: The gradients were computed correctly and independently in the child processes. However, I was writing back results from the different child processes into the same CUDA memory location. That naturally lead to inconsistent results.

dat_pham_thanh · April 11, 2018, 6:19pm

Hi
I also want to do it? Can you share for me some ideas to do that?

Thanks