I update the parameters of the designed model after several iterative forward and backward calls, which means that the DataParallel does not need to replicate self.module in every forward path, and it only needs to accumulate the gradients on the other gpus when I am going to update the parameters with a call to step() of the optimizer.
So, is there any way to seperate the replicating and the collecting of gradients of DataParallel? It greatly hurts the performance in my implementation(https://github.com/anoidgit/transformer).
I tried to make this through the following code:
class DataParallelModel(DataParallel):
def __init__(self, module, device_ids=None, output_device=None, dim=0, host_control=True):
super(DataParallelModel, self).__init__(module, device_ids, output_device, dim)
if host_control:
self.nets = self.update_replicates()
else:
self.nets = None
def forward(self, *inputs, **kwargs):
if not self.device_ids:
return self.module(*inputs, **kwargs)
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
if len(self.device_ids) == 1:
return self.module(*inputs[0], **kwargs[0])
if self.nets is None:
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
else:
replicas = self.nets[:len(inputs)]
outputs = self.parallel_apply(replicas, inputs, kwargs)
return self.gather(outputs, self.output_device)
def update_replicates(self):
return self.replicate(self.module, self.device_ids)
But an assertation failure was throwed by autograd.