Can dist.TCPStore store NamedTuple?

JuyiLin · November 8, 2022, 3:58pm

I want to share some NamedTuple (such as mytuple) between different rank. The following codes is possible?

    tcpstore = dist.TCPStore(MASTER_ADDR,  MASTER_PORT, world_size,
                             MASTER_ADDR == LOCAL_ADDR)
    dist.init_process_group('nccl', store= tcpstore, rank=rank, world_size=world_size)
    if rank == 0:
        store.set("my1", mytuple)
    else:
        id = store.get("my1",mytuple)

I have read the
How to store embeddings from different ranks in DistributedDataParallel mode? - #4 by mrshenli . But I want to know if I have 8GPU, how could I init and pass 8 simplequeue?

wanchaol · November 8, 2022, 8:13pm

@JuyiLin could you share more about your motivation? dist.Store is only intended to be used by process group init, it’s not exposing to public arbitrary usage, it might work out of box for some cases, but it’s not guaranteed.

Specifically if you want to share tuple of tensors, you can dist.broadcast each tensor to each rank

JuyiLin · November 8, 2022, 8:27pm

Thank you for your time! I have tried to use dist.scatter_object_list, but it failed. Could you have a look?

github.com/pytorch/pytorch

distributed.scatter_object_list shows RuntimeError: Tensors must be CUDA and dense

opened 04:58PM - 08 Nov 22 UTC

LukeLIN-web

### 📚 The doc issue I tried https://pytorch.org/docs/stable/distributed.html#t…orch.distributed.scatter_object_list codes ```python def run(rank, world_size, data, dataset): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' dist.init_process_group('nccl', rank=rank, world_size=world_size) torch.cuda.set_device(rank) if rank == 0 : objects = ["foo", 12, {1: 2}] # any picklable object else: objects = [None, None, None] objects =objects outputlist = [None] dist.scatter_object_list(outputlist, objects, src=0) print(outputlist) if __name__ == '__main__': dataset = Reddit('/data/Reddit') world_size = torch.cuda.device_count() data = dataset[0] print('Let\'s use', world_size, 'GPUs!') data_split = (data.train_mask, data.val_mask, data.test_mask) mp.spawn( run, args=(world_size, data, dataset), nprocs=world_size, join=True ) ``` But it shows: Traceback (most recent call last): File "smallcase_reddit_quiver_send.py", line 99, in <module> File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/share/gnnproject/quiver/smallcase_reddit_quiver_send.py", line 84, in run dist.scatter_object_list(outputlist, objects, src=0) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1919, in scatter_object_list broadcast(max_tensor_size, src=src, group=group) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1159, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: Tensors must be CUDA and dense ### Suggest a potential alternative/fix _No response_