I have tested some cases, and find that if you use DataParallel to warp a model, the input tensor can be either on cpu or any gpu, while the model should be on dev0. You can use to('cuda:x) before DataParalle or after it, but need to guarantee the X equals dev0. Here is another example which use set_device to change current device and call model.cuda() after DataParallel.