"phantom" process with zero GPU usage

I am running a Distributed Data Parallel (DDP) job on a single node. The node has 10 GPUs, but I’m using only 4 of them. The cluster is powered by slurm, and I use sbatch exp.sh to submit the job, where the exp.sh is as follows:

#!/bin/bash
#SBATCH --partition=$MY_PARTITION
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=2

srun python main.py

In the main.py (and other .py files it calls), I have set necessary environmental variables and created communication between workers:

os.environ['LOCAL_RANK'] = os.environ['SLURM_LOCALID']
os.environ['RANK'] = os.environ['SLURM_PROCID']
os.environ['WORLD_SIZE'] = os.environ['SLURM_NTASKS']
os.environ['MASTER_ADDR'] = my_host_name
os.environ['MASTER_PORT'] = '32767'
torch.distributed.init_process_group(backend='nccl')

The job seems to run smoothly. Then I login to that node, and type nvidia-smi. I get

Please ignore GPU 4-9, as the job is not using them at all. However, on GPU 0-3, there are 16 processes, with 4 for each GPU. 3 out of these 4 are 0 usage! Instead, I think there should be 1 process for each GPU, so a total of 4 processes.

I also tried on other nodes, some with better GPUs and network connections. The above observation is not reproducible. Sometimes, I do get 1 process for each GPU instead.

Can anyone give some insights on what’s happenning?