How pytorch's parallel method and distributed method works?

rasbt · February 19, 2019, 6:31am

You actually can use a different device as default device for data parallelism. However, this must be one of the GPUs that is also used by DataParallel. You can set it via the output_device parameter.

Regarding using a GPU that is not wrapped by DataParallel, that’s currently not possible (see discussion here: Uneven GPU utilization during training backpropagation)