Single-Process Multi-GPU is not the recommended mode for DDP


Hi, Everyone. I have encountered some problem about pytorch ddp on single node multiple gpus.
My setting is follow as:

os.environ["MASTER_PORT"] = "9999" 
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
distributed_sampler =
torch_dataloader =,

model = torch.nn.parallel.DistributedDataParallel(model)

But this setting is slower than DataParallel, and get some message.

Error Message

UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. 
In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. 
The overhead of scatter/gather and GIL contention in every forward pass can slow down training.
Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. 


python: 3.7
pytorch: 1.7
GCP ml-engine image_uri:
gpu_type: complex_model_m_p100 (p100x4 on single node)

Hope someone can answer my problem. I will appreciate.

As you can see from the error message, it’s better to use multiple process for multiple GPU training even on a single node, you can use torch.multiprocessing.spawn(train_fn, args=(world_size,), nprocs=world_size) to initialize the training in multiple process