How pytorch's parallel method and distributed method works?

You actually can use a different device as default device for data parallelism. However, this must be one of the GPUs that is also used by DataParallel. You can set it via the output_device parameter.

Regarding using a GPU that is not wrapped by DataParallel, that’s currently not possible (see discussion here: Uneven GPU utilization during training backpropagation)