Using `torch.distributed.all_gather_object` returns error when using 1 GPU but works fine for multiple GPUs

seankala · May 3, 2023, 12:32am

I’m currently using HuggingFace Accelerate to run some distributed experiments and have the following code inside of my evaluation loop:

model.eval()
device = accelerator.device
intermediate_value = {}
output = [None] * accelerator.num_processes

# Some evaluation code.

dist.all_gather_object(output, intermediate_value)

When I’m using multiple GPUs it’s fine, but when I’m using only one I get the following error:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

What I’m wondering is, I thought that if you wrap your model, optimizer, etc. using the HuggingFace Accelerate module then you didn’t have to do torch.distributed.init_process_group? And if this is the case, then how come it’s not working when I only have 1 GPU?

Thanks in advance.

ptrblck · May 3, 2023, 6:59am

This question seems to be HuggingFace-specific so you might want to post it in their discussion board.