You actually can use a different device as default device for data parallelism. However, this must be one of the GPUs that is also used by DataParallel. You can set it via the output_device
parameter.
Regarding using a GPU that is not wrapped by DataParallel, that’s currently not possible (see discussion here: Uneven GPU utilization during training backpropagation)