Discovering GPUs in multinode environment

AKT_TARAFDER · October 11, 2024, 8:04am

We have 2 nodes, each with 4 A100 GPU.
We are trying to use multiprocessing to create 8 processes for each GPU.
The issue is torch.distributed only shows 4 in the device count because it shows that for one node.
How can we get access to all 8 GPU in 2 nodes?
The goal would be to have all 8 of them in process group or a certain selection of them so that we can perform all_gather/all_reduce and also assign each process to one GPU, that could be from different nodes.
We are dividing the tensors ourselves, so we are not using some of the things like DDP, FSDP. We only need some construct that allows each process to handle on of 8 GPUs each and have a process group of all 8 GPUS.
we want the ranks for all 8 GPUs to be like 0, 1, 2, 3, 4, 5, 6, 7
then we compute on them, handling them through different torch processes
What should be the approach? From the pytorch documentation, it is not clear how we can achieve this.

Tony-Y · October 11, 2024, 8:22am

https://pytorch.org/docs/stable/elastic/run.html#environment-variables

RANK - The rank of the worker within a worker group.

Cannot you utilize the environment variable RANK?

AKT_TARAFDER · October 11, 2024, 4:36pm

Hello,
I am not using torchrun.
We are discovering the number of GPUs in the environment. Then ask users to decide how much GPUS they want, and we manually launch the distributed run based on user inputs.
we did it fine for the single-node system. This easily discovers the number of GPUs.

Problem is torch.distributed.get_device_count() only gives single node information(I could be wrong).
I want something that discovers the environment itself. or at least assign ranks through different nodes. Like node1( ranks:0,1,2,3) and node2 (4,5,6,7)

Is there a way to discover the whole multimode structure, like it would show node1, node2
GPU0, GPU1, GPU2, GPU3 for node 1, and similarly for node 2, we would discover GPU4, GPU5, GPU6, GPU7
This should be done from inside a python function so that I can launch multiple processes.

Thank you, the suggestions would be really helpful.

Tony-Y · October 12, 2024, 1:24am

If you don’t know the standard usage of torch.distributed in multi node, the following web page might be helpful:

Tony-Y · October 12, 2024, 2:15am

Do you really have torch.distributed.get_device_count(). I didn’t find that function in the official PyTorch package. Maybe, torch.cuda.device_count().

AKT_TARAFDER · October 12, 2024, 3:52am

Yes, sorry for the miss.
but you are right.
what I want is some mechanism to get all the node and related GPU
that means if I have 2 node, I get GPU 0,1,2,34,5,6,7
and can launch multiprocessing processes in each of them, then all_reduce from all of them.

Tony-Y · October 12, 2024, 4:18am

In a multi node scenario, you have to specify nodes to be used. PyTorch doesn’t detect available nodes automatically.

AKT_TARAFDER · October 12, 2024, 6:18am

Thank you.
But what are the probable ways of using these nodes, GPUs apart from pytorch?

Tony-Y · October 12, 2024, 6:42am

MPI for Python + CuPy is a possible solution. Please see Example Code.