DataParallel does not work in RL algorithms

I wanted to test multi-gpu with simple DQN algorithm for Cartpole environment, and it seems that either it is not possible or I am missing something. I run the code in a machine which has two gpus. I have my code here :

The problem is that in each train step, I need to obtain the target value, which is a DataParallel object, then multiply it to some elements in the batch (which is a tensor) and then use the output to obtain the loss. When, I multiply target value to the tensor, I am getting this error:

RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 1

Since the DataParallel has split the batch into two parts, I have target shape of 2*64 instead of 128. I can reshape the tensor to 2*64 and make it work, but I it is not gonna work If I have more gpus and reshaping would be hard and messy. I thought there should be a better way to do this.
I appreciate any help or comment.