I want to train an ensemble of networks on a single GPU. The dataset and the model are small enough for this. When computing the gradients, I get wrong results (i.e. different numerical values than if I worked with a single process).
This is my setup:
I use torch.multiprocessing to spawn new processes. The main nn.Module
is created in the main process and passed on to the subprocesses. Since all parameters are then still pointing to the same Cuda.{...}Tensors
, I assign copied versions of the Parameter
s as follows in each process:
# Copy parameters so that independent processes have independent parameters
for mod in model.modules():
for name, parameter in mod._parameters.items():
var_clone = parameter.clone()
mod.register_parameter(name, Parameter(var_clone.data))
If I didn’t do that, the parameters would be shared between the processes. However, I want an ensemble of networks with independent parameters (not hogwild).
When computing the gradient, the parameters still seem to share the parameters’ .grad
information in a race condition. So sometimes the gradient information mixes and wrong gradients are computed.
From the documentation at http://pytorch.org/docs/master/notes/multiprocessing.html, only Cuda.{...}Tensors
are shared between processes when using multiprocessing. Here, I seem to encounter shared autograd.Variable
s. Am I missing something here?
(Sidenote: There are ways to circumvent this, e.g. by re-creating the main nn.Module
in each thread. Then the parameters can never be shared. This requires a major restructurisation of my code.)