Discovering GPUs in multinode environment

We have 2 nodes, each with 4 A100 GPU.
We are trying to use multiprocessing to create 8 processes for each GPU.
The issue is torch.distributed only shows 4 in the device count because it shows that for one node.
How can we get access to all 8 GPU in 2 nodes?
The goal would be to have all 8 of them in process group or a certain selection of them so that we can perform all_gather/all_reduce and also assign each process to one GPU, that could be from different nodes.
We are dividing the tensors ourselves, so we are not using some of the things like DDP, FSDP. We only need some construct that allows each process to handle on of 8 GPUs each and have a process group of all 8 GPUS.
we want the ranks for all 8 GPUs to be like 0, 1, 2, 3, 4, 5, 6, 7
then we compute on them, handling them through different torch processes
What should be the approach? From the pytorch documentation, it is not clear how we can achieve this.

https://pytorch.org/docs/stable/elastic/run.html#environment-variables

  1. RANK - The rank of the worker within a worker group.

Cannot you utilize the environment variable RANK?

Hello,
I am not using torchrun.
We are discovering the number of GPUs in the environment. Then ask users to decide how much GPUS they want, and we manually launch the distributed run based on user inputs.
we did it fine for the single-node system. This easily discovers the number of GPUs.

Problem is torch.distributed.get_device_count() only gives single node information(I could be wrong).
I want something that discovers the environment itself. or at least assign ranks through different nodes. Like node1( ranks:0,1,2,3) and node2 (4,5,6,7)

Is there a way to discover the whole multimode structure, like it would show node1, node2
GPU0, GPU1, GPU2, GPU3 for node 1, and similarly for node 2, we would discover GPU4, GPU5, GPU6, GPU7
This should be done from inside a python function so that I can launch multiple processes.

Thank you, the suggestions would be really helpful.

If you don’t know the standard usage of torch.distributed in multi node, the following web page might be helpful:

Do you really have torch.distributed.get_device_count(). I didn’t find that function in the official PyTorch package. Maybe, torch.cuda.device_count().

Yes, sorry for the miss.
but you are right.
what I want is some mechanism to get all the node and related GPU
that means if I have 2 node, I get GPU 0,1,2,34,5,6,7
and can launch multiprocessing processes in each of them, then all_reduce from all of them.

In a multi node scenario, you have to specify nodes to be used. PyTorch doesn’t detect available nodes automatically.

Thank you.
But what are the probable ways of using these nodes, GPUs apart from pytorch?

MPI for Python + CuPy is a possible solution. Please see Example Code.