Arguments are located on different GPUs; couldn't solve by register_buffer

position: cuda:0
mask    : cuda:0
input   : cuda:0
position: cuda:0
mask    : cuda:1
input   : cuda:1

It turns out that after I revised the code, the position and mask seems to be in different cuda, even when I specify
CUDA_VISIBLE_DEVICE=1 python3.6 train.py before running.