Hi,
There is a sharing tensor (say A) in my model, which is computed at the begining of each training epoch.
I do not want A to be computed parallel, but the rest of my model runs in parallel.
In order to do this, I compute and register A to model instance every epoch, so my running process is like the code below:

def run_epoch(self):
self.model.A = self.compute_A()
result = self.model(input) # model is instance of nn.DataParallel, and will use A in this calling

However I get the error below:

RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:253

Currently I think computing A at each GPU will solve this problem, but this method will consume more time, and I do not want to waste computing time. So how can I solve it elegantly?

If I properly understood you compute a single tensor once per training and then reuse it in each iteration. Why don’t you compute that outsite the model and use dataparallel over the rest?

Hmmm so if correctly understood, you have a tensor A, which is computed once per epoch. You need A inside your model but you don’t want it to be computed in parallel.

I find a simple way of doing it
Depending on the role they play in your network you can wrap your network keeping that one as external like:

class wraped_net(nn.module)
__init__(A,model)
self.A = A
self.my_real_model = nn.Datapralel(model)
forward(inputs):
return self.my_real_model(inputs,A)

Drawback of this approach is that you will be copying A in each gpu.
If you use it in a very simple way, you can externalize that operation at the begining

class wraped_net(nn.module)
__init__(A,model)
self.A = A
self.my_real_model = nn.Datapralel(model)
forward(inputs):
do_something_with_A_output=do_something_with_A(inputs,A)
return self.my_real_model(do_something_with_A_output)