Questions and Help
Hi, everyone. When i use ddp
, i have encounter some question…
And I want to running on single node 4 gpus
on gcp.
If i set num_work=0
, it will be work but training is slow. I want to boost training time.
But i set num_work>0
that always get follow error message.
Error message
RuntimeError: DataLoader worker exited unexpectedly with exit code 1.
Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace
code
import torch
from absl import app
def launch_training_job(local_rank,
processed_dataset):
### ddp ###
torch.distributed.init_process_group(backend='nccl',
world_size=4,
rank=local_rank)
torch.cuda.set_device(local_rank)
print('[INFO] Starting nccl for ddp.')
distributed_sampler = torch.utils.data.distributed.DistributedSampler(processed_dataset)
processed_sms_dataloader = torch.utils.data.DataLoader(processed_dataset,
batch_size=32,
pin_memory=True,
num_workers=2,
sampler=distributed_sampler)
def main(argv):
.......
num_gpus=4
torch.multiprocessing.spawn(launch_training_job,
args=(processed_dataset),
nprocs=num_gpus)
if __name__ == "__main__":
app.run(main)
Environment
- GCP ml-engine
complex_model_m_p100
CPUs: 16
RAM: 60 GB
GPU: NVIDIA Tesla P100 * 1
- gcp image uri:
gcr.io/cloud-ml-public/training/pytorch-gpu.1-7
Hope someone can help, I will appreciate…
@ptrblck I found that you have answered a similar issue
Can you get some advice and help… Thank you