Setting up DistributedDataParallel for single-node "multi-CPU" training

Hi Forum,

I am adapting model training for multiple processes on CPU (multiple cores) on one node with DistributedDataParallel.

When running the script with torchrun I receive the following error message:

NOTE: Redirects are currently not supported in Windows or MacOs.
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
NOTE: Redirects are currently not supported in Windows or MacOs.

[W socket.cpp:601] [c10d] The IPv6 network addresses of
(, 61266) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).

WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 37667 closing signal SIGINT
File “/Users/ms/ms_transformer/ms_transformer/src/embedding/”, line 145, in run_node
File “/Users/ms/ms_transformer/ms_transformer/src/embedding/”, line 128, in init_process_group
File “/Users/ms/envs/pytorch/lib/python3.10/site-packages/torch/distributed/”, line 900, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File “/Users/ms/envs/pytorch/lib/python3.10/site-packages/torch/distributed/”, line 245, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File “/Users/ms/envs/pytorch/lib/python3.10/site-packages/torch/distributed/”, line 176, in _create_c10d_store
return TCPStore(
RuntimeError: Interrupted system call

(I terminate the script with Control-C

I invoke:

torchrun --standalone --nnodes=1 --nproc_per_node=1

The error appears regardless of what I choose for —nproc_per_node.

It appears node discovery does not work correctly? My impression is that it ought to be possible to distribute training on multiple processes. What might be the problem?
PyTorch 2.0.1, M1 on OS X Ventura.

I set up the process group with:

def init_process_group():
local_rank = int(os.environ[“LOCAL_RANK”])
dist.init_process_group(“gloo”, rank=local_rank)

def cleanup_process_group():

LOCAL_RANK rank is read correctly from env.

I wrap the model:

model = DDP(model, find_unused_parameters=True)

and leave ‘device_ids’ and ‘output_device’ as None, as pr the documentation.

Any input would be greatly appreciated,

1 Like

I’m not familiar with training on the M1 CPU, but I’m curious why you would need DDP on a single-node for CPU training. My understanding is that typical numerical libraries are able to leverage multicore CPUs behind the scenes for operations such as matrix multiply and many pointwise operations. From the stack trace I’m guessing that you’re training some kind of transformer model which should be fairly matmul heavy. Have you checked whether training without DDP yields acceptable utilization of your system?

I am training an embedding with tensors too large to fit on my GPU.
PyTorch does indeed distribute work across processes on my machine, but not as efficiently as I would like, even though it can be tweaked.
The aim is to scale up training, and so I am concerned with effective scaling. I don’t think multi-GPU training will be cost-effective for this application.