Single-Process Multi-GPU is not the recommended mode for DDP

Problem

Hi, Everyone. I have encountered some problem about pytorch ddp on single node multiple gpus.
My setting is follow as:

os.environ["MASTER_PORT"] = "9999" 
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
.....
distributed_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
torch_dataloader = torch.utils.data.DataLoader(dataset,
                                                           batch_size=64,
                                                           pin_memory=True,
                                                           num_workers=4,
                                                           sampler=distributed_sampler)

model.cuda()
model = torch.nn.parallel.DistributedDataParallel(model)

But this setting is slower than DataParallel, and get some message.

Error Message

UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. 
In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. 
The overhead of scatter/gather and GIL contention in every forward pass can slow down training.
Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. 

Environment

python: 3.7
pytorch: 1.7
GCP ml-engine image_uri: gcr.io/cloud-ml-public/training/pytorch-gpu.1-7
gpu_type: complex_model_m_p100 (p100x4 on single node)

Hope someone can answer my problem. I will appreciate.

As you can see from the error message, it’s better to use multiple process for multiple GPU training even on a single node, you can use torch.multiprocessing.spawn(train_fn, args=(world_size,), nprocs=world_size) to initialize the training in multiple process