Gloo timeout unboubd buffer

I am running in a gloo timeout when I try to run gather object from the other processes:

mask_right = [None for _ in range(world_size)]
torch.distributed.all_gather_object(mask_right, all_predictions_tmp[rxn]['mask_right'])

The size of each all_predictions_tmp[rxn]['mask_right'] is quite big - of 440076 instances so I am not sure whether this might be the cause of the timeout. The above lines work when I am working with smaller objects - so I am assuming the size is an issue. Although I somehow tried batching the size of the tensor that I wanted to gather but that still did not seem to work. Below is the full error:

Traceback (most recent call last):
  File "/home/ubuntu/NinaNotebooks/forward-synthesis/stages/model/", line 393, in <module>
  File "/home/ubuntu/NinaNotebooks/forward-synthesis/.venv/lib/python3.9/site-packages/torch/multiprocessing/", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ubuntu/NinaNotebooks/forward-synthesis/.venv/lib/python3.9/site-packages/torch/multiprocessing/", line 198, in start_processes
    while not context.join():
  File "/home/ubuntu/NinaNotebooks/forward-synthesis/.venv/lib/python3.9/site-packages/torch/multiprocessing/", line 160, in join
    raise ProcessRaisedException(msg, error_index,

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/NinaNotebooks/forward-synthesis/.venv/lib/python3.9/site-packages/torch/multiprocessing/", line 69, in _wrap
    fn(i, *args)
  File "/home/ubuntu/NinaNotebooks/forward-synthesis/stages/model/", line 268, in benchmark
    torch.distributed.all_gather_object(mask_right, all_predictions_tmp[rxn]['mask_right'][i*batch_len:(i+1)*batch_len])
  File "/home/ubuntu/NinaNotebooks/forward-synthesis/.venv/lib/python3.9/site-packages/torch/distributed/", line 1657, in all_gather_object
    all_gather(object_size_list, local_size, group=group)
  File "/home/ubuntu/NinaNotebooks/forward-synthesis/.venv/lib/python3.9/site-packages/torch/distributed/", line 2075, in all_gather
RuntimeError: [../third_party/gloo/gloo/transport/tcp/] Timed out waiting 1800000ms for send operation to complete

Any help would be great, thank you!