Torch.mp.spawn gets stuck when using DataLoader with num_workers > 0

JKSINGH · November 22, 2022, 1:37pm

I’m training a model using DDP on 4 GPUs and 32 vcpus.

I’m using DDP with torch.mp.spawn to do this, while using num_workers =0 the below code runs fine, it train the 3 models one after the other.
but when i run the same with num_workers = 4, the speed increase is 3.3x in the training for model1,
after the training of model1 completes (all the ranks reached the “training complete”), it gets stuck at the mp.spawn() fn and hence it is stuck and no training for model2 starts.

def multi_gpu_training(rank,model_class,mel_spec,world_size,name,path,metric_key,eval_mode,wandb_):
    torch.multiprocessing.set_sharing_strategy('file_system')  # too many files open error
    os.environ['MASTER_ADDR']="localhost"
    os.environ["MASTER_PORT"]="12335"
    init_process_group(backend='nccl',rank=rank, world_size=world_size)

    device=torch.device(rank)
    print("rank",rank,device)
    model = load_model(model_class).to(device)
    model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
    model = DDP(model,device_ids=[rank],find_unused_parameters=False)
    train_dataset = Duration_Dataset(train_file,config.f0_file,config.durations_file,config.xvectors_file,mel_spec=mel_spec)
    train_dataset = DataLoader(train_dataset,pin_memory=True,persistent_workers=True,batch_size=config.batch_size,shuffle=False,collate_fn=batch_processing,num_workers=conf$
    test_dataset = None
    if rank == 0:
        test_dataset = Duration_Dataset(val_file,config.f0_file,config.durations_file,config.xvectors_file,mel_spec=mel_spec)
        test_dataset = DataLoader(test_dataset,batch_size=config.batch_size,shuffle=False,collate_fn=batch_processing,sampler=None)
    train_model(train_dataset,model,device,name=name,path=path,metrics_key=metric_key,eval_mode=e

return "training complete"

def multi_gpu_process(model,mel_spec,name,path,metrics_key,eval_mode,wandb_):
    #torch.multiprocessing.set_start_method('spawn')
    world_size = torch.cuda.device_count()
    print('World Size:',world_size)
    mp.spawn(multi_gpu_training, args=(model,mel_spec,world_size,name,path,metrics_key,eval_mode,wandb_), nprocs=world_size)
    print('out of the spawning')

if __name__=='__main__':

 multi_gpu_process(model1,mel_spec,name,path,metrics_key,eval_mode,wandb_)
 multi_gpu_process(model2,mel_spec,name,path,metrics_key,eval_mode,wandb_)
 multi_gpu_process(model3,mel_spec,name,path,metrics_key,eval_mode,wandb_)

H-Huang · November 22, 2022, 7:18pm

cc @ejguan @nivek for dataloader. Any ideas why num_workers would impact this?

You could also try instead of spawning your own processes in the training script, spawning them outside the script and using torch.distributed.run torchrun (Elastic Launch) — PyTorch master documentation

ejguan · November 22, 2022, 7:50pm

Could you please try to explicitly clean up dataloader for training since you are using persistent_worker?

You can do:

it = iter(train_dataset)
it._shutdown_worker()
del train_dataset

JKSINGH · November 23, 2022, 7:27am

Thanks for replying so fast!
I have tried it with torchrun, but I’m getting ‘broken pipe error’ after the training of each model.
That’s why I switched to mp.spawn .
But now that I tried with torchrun the stuck error is gone,
I guess it is better to resolve that broken pipe error.

JKSINGH · November 23, 2022, 7:43am

I tried this code earlier didn’t work:
#train_dataset._iterator._shutdown_workers()
#del train_dataset._iterator
#train_dataset,test_dataset,model=None,None,None
#gc.collect()
#torch.cuda.empty_cache()

For this code:
it = iter(train_dataset)
it._shutdown_worker()
del train_dataset
here it said AttributeError: ‘_MultiProcessingDataLoaderIter’ object has no attribute ‘_shutdown_worker’

I read the issues on github it is related to mp.spawn.

but I’ll be using torchrun now so it is resolved.

ejguan · November 23, 2022, 2:31pm

Do you mind sharing which issue you have seen? I might take a look at the mp.spawn issue.

JKSINGH · November 24, 2022, 10:04am

github.com/pytorch/pytorch

AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_workers_status'

opened 12:13PM - 12 Sep 21 UTC

closed 06:12AM - 04 Dec 21 UTC

WaylonZhangW

needs reproduction triaged module: ddp

Hi,when I try to use data distributed parallel to speed up training model, an er…ror occurs: AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_workers_status'. But it's ok to use data parallel. I have been struggling for this bug for several days, thanks for provides any hints or some informations! The output of python -m torch.utils.collect_env is as followed: Collecting environment information... PyTorch version: 1.9.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.18.20200915-gb7590b8 Libc version: glibc-2.17 Python version: 3.7 (64-bit runtime) Python platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-debian-buster-sid Is CUDA available: True CUDA runtime version: 10.2.89 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti GPU 3: GeForce RTX 2080 Ti Nvidia driver version: 450.80.02 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.2 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.2 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.2 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.2 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.2 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.2 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.2 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.21.2 [pip3] pytorch-transformers==1.2.0 [pip3] torch==1.9.0 [pip3] torchtext==0.10.0 [pip3] torchvision==0.10.0 [conda] numpy 1.21.2 pypi_0 pypi [conda] pytorch-transformers 1.2.0 pypi_0 pypi [conda] torch 1.9.0 pypi_0 pypi [conda] torchtext 0.10.0 pypi_0 pypi [conda] torchvision 0.10.0 pypi_0 pypi1 And the all error infomation is as followed: ` Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb1d6035d40> Traceback (most recent call last): File "/root/anaconda3/envs/m4c/lib/python3.7/site-packages/torch-1.9.0-py3.7-linux-x86_64.egg/torch/utils/data/dataloader.py", line 1328, in __del__ self._shutdown_workers() File "/root/anaconda3/envs/m4c/lib/python3.7/site-packages/torch-1.9.0-py3.7-linux-x86_64.egg/torch/utils/data/dataloader.py", line 1295, in _shutdown_workers if self._persistent_workers or self._workers_status[worker_id]: AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_workers_status'`

I meant to say related to threading (num of workers) and not primarily mp.spawn,
apologies for the confusion.

JuyiLin · November 25, 2022, 1:21pm

See NeighborLoader with loader worker processes fails on GPU · Issue #5340 · pyg-team/pytorch_geometric · GitHub