Dataloader Rerunning with num_workers=0 may give better error trace

:question: Questions and Help

Hi, everyone. When i use ddp, i have encounter some question…
And I want to running on single node 4 gpus on gcp.
If i set num_work=0, it will be work but training is slow. I want to boost training time.
But i set num_work>0 that always get follow error message.

Error message

RuntimeError: DataLoader worker exited unexpectedly with exit code 1. 
Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace

code

import torch
from absl import app

def launch_training_job(local_rank, 
                       processed_dataset):
    ### ddp ###
    torch.distributed.init_process_group(backend='nccl',
                                         world_size=4,
                                         rank=local_rank)
    torch.cuda.set_device(local_rank)
    print('[INFO] Starting nccl for ddp.')
    
    distributed_sampler = torch.utils.data.distributed.DistributedSampler(processed_dataset)
    
    processed_sms_dataloader = torch.utils.data.DataLoader(processed_dataset,
                                                           batch_size=32,
                                                           pin_memory=True,
                                                           num_workers=2,
                                                           sampler=distributed_sampler)

def main(argv):
    .......
    num_gpus=4
    torch.multiprocessing.spawn(launch_training_job,
                                args=(processed_dataset),
                                nprocs=num_gpus)
if __name__ == "__main__":
    app.run(main)

Environment

  • GCP ml-engine complex_model_m_p100
CPUs: 16
RAM: 60 GB
GPU: NVIDIA Tesla P100 * 1
  • gcp image uri: gcr.io/cloud-ml-public/training/pytorch-gpu.1-7

Hope someone can help, I will appreciate…
@ptrblck I found that you have answered a similar issue
Can you get some advice and help… Thank you

Can you confirm that there are 4 GPUs available on your machine? You can find that through torch.cuda.device_count(). An alternative is to do this programmatically when starting DDP, you can try something like:

import torch
from absl import app

def launch_training_job(local_rank, world_size
                       processed_dataset):
    ### ddp ###
    torch.distributed.init_process_group(backend='nccl',
                                         world_size=world_size,
                                         rank=local_rank)
    torch.cuda.set_device(local_rank)
    print('[INFO] Starting nccl for ddp.')
    
    distributed_sampler = torch.utils.data.distributed.DistributedSampler(processed_dataset)
    
    processed_sms_dataloader = torch.utils.data.DataLoader(processed_dataset,
                                                           batch_size=32,
                                                           pin_memory=True,
                                                           num_workers=world_size,
                                                           sampler=distributed_sampler)

def main(argv):
    .......
    # world_size == number of GPUs
    world_size=torch.cuda.device_count()
    torch.multiprocessing.spawn(launch_training_job,
                                args=(processed_dataset, world_size),
                                nprocs=world_size)
if __name__ == "__main__":
    app.run(main)

If this does not help, perhaps you can try posting this in the data loader questions of the forums instead?