Multi-gpu training errors

I have a two-stream CNN net. One stream is in cuda 0 and outputs feature f0. Another is in cuda 1 and outputs feature f1. The features are moved to cpu by f0 = f0.cpu() , f1 = f1.cpu(). After computing the loss =L2(f1,f2) on cpu.

In the backward stage, errors are reported as follows:

File “bilinear.py”, line 585, in
train(i)
File “bilinear.py”, line 517, in train
loss.backward()
File “/home/zhengyun.zy/anaconda/lib/python2.7/site-packages/torch/autograd/variable.py”, line 146, in backward
self._execution_engine.run_backward((self,), (gradient,), retain_variables)
File “/home/zhengyun.zy/anaconda/lib/python2.7/site-packages/torch/nn/_functions/thnn/auto.py”, line 175, in backward
update_grad_input_fn(self.backend.library_state, input, grad_output, grad_input, *gi_args)
RuntimeError: Assertion `THCTensor
(checkGPU)(state, 3, input, gradInput, gradOutput)’ failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /py/conda-bld/pytorch_1493676237139/work/torch/lib/THCUNN/generic/Threshold.cu:49

I think you should use Dataparallel instead.