SendRecv exchange messages with wrong tag

Hello,
i encounter a strange behavior with messages that get exchanged even though their tag mismatch.

Question

Why is the first message used in dist.recv() even though the tag obviously mismatch?

Minimal Example

"""SEND RECV TEST WITH MULTIPLE TAGS """

import os
import torch
import horovod.torch as hvd
import torch.distributed as dist


if __name__ == '__main__':
    
    print('CUDA AVAILABLE:',torch.cuda.is_available())
    
    print('CUDA device count:',torch.cuda.device_count())
    
    neighbors_left = [0,1]
    neighbors_right = [1,2]
    
    DIR = os.getcwd()
    init_file = f'file://{DIR}{os.sep}sharedfile'
    
    hvd.init()
    rank = hvd.local_rank()
    torch.cuda.set_device(rank)
    print('Current Device:',torch.cuda.current_device())
    
    device = torch.device('cuda')
    dist.init_process_group(backend="nccl", init_method=init_file,rank=rank, world_size=hvd.size())
    
    left = rank - 1
    right = rank + 1
    
    a = torch.ones((3,4)) 
    
    a[:,0] = a[:,0]*((rank+1) * 100)
    a[:,-1] = a[:,-1]*((rank+1) * 10)
    
    a = a.to(device, non_blocking=True)
    send_buff = a[:,0].contiguous()
    recv_buff = a[:,-1].contiguous()
    for i in range(hvd.size()):
        if rank == i:
             print(f'{rank=}: {a=}')  
        dist.barrier()
   
    
    dist.barrier()
    sends = []
    print(f'{hvd.local_rank()=} : {dist.get_rank()=}')
    if left >= 0:
        print(f'{hvd.local_rank()=} sends to {left=}')
        sends.append(dist.isend(send_buff+8, left, tag=8))
        sends.append(dist.isend(send_buff, left, tag=1))
        sends.append(dist.isend(send_buff+3, left, tag=3))
    
		for s in sends:
		  s.wait()
		
    if right < hvd.size():
        print(f'{hvd.local_rank()=} receives from {right=}')
        dist.recv(recv_buff, right, tag=1)
        a[:,-1] = recv_buff
    
    dist.barrier()
    for i in range(hvd.size()):
        if rank == i:
             print(f'{rank=}: {a=}')  
        
        dist.barrier() 

the example is called with horovodrun -np 4 python ./torch_dist_sendleft.py

Expected Output

[1,0]<stdout>:RANK: 0 A' = 
[1,0]<stdout>: tensor([[100.,   1.,   1., 200.],
[1,0]<stdout>:        [100.,   1.,   1., 200.],
[1,0]<stdout>:        [100.,   1.,   1., 200.]], device='cuda:0')
[1,1]<stdout>:RANK: 1 A' = 
[1,1]<stdout>: tensor([[200.,   1.,   1., 300.],
[1,1]<stdout>:        [200.,   1.,   1., 300.],
[1,1]<stdout>:        [200.,   1.,   1., 300.]], device='cuda:1')
[1,2]<stdout>:RANK: 2 A' = 
[1,2]<stdout>: tensor([[300.,   1.,   1., 400.],
[1,2]<stdout>:        [300.,   1.,   1., 400.],
[1,2]<stdout>:        [300.,   1.,   1., 400.]], device='cuda:2')
[1,3]<stdout>:RANK: 3 A' = 
[1,3]<stdout>: tensor([[400.,   1.,   1.,  40.],
[1,3]<stdout>:        [400.,   1.,   1.,  40.],
[1,3]<stdout>:        [400.,   1.,   1.,  40.]], device='cuda:3')

Current Output (with logs)

Module GCC/10.3.0 and 3 dependencies loaded.
Module CUDA/11.3.1 loaded.
Module NCCL/2.9.9-CUDA-11.3.1 loaded.
Module OpenMPI/4.1.1 and 10 dependencies loaded.
Module Python/3.9.5-bare and 6 dependencies loaded.
 Running on multiple GPU devices on single node

 Run started at:- 
Sa 14. Mai 14:18:52 CEST 2022
[1,0]<stdout>:CUDA AVAILABLE: True
[1,0]<stdout>:CUDA device count: 4
[1,2]<stdout>:CUDA AVAILABLE: True
[1,2]<stdout>:CUDA device count: 4
[1,3]<stdout>:CUDA AVAILABLE: True
[1,3]<stdout>:CUDA device count: 4
[1,1]<stdout>:CUDA AVAILABLE: True
[1,1]<stdout>:CUDA device count: 4
[1,3]<stdout>:Current Device: 3
[1,1]<stdout>:Current Device: 1
[1,0]<stdout>:Current Device: 0
[1,2]<stdout>:Current Device: 2
[1,3]<stdout>:rank=3: a=tensor([[400.,   1.,   1.,  40.],
[1,3]<stdout>:        [400.,   1.,   1.,  40.],
[1,3]<stdout>:        [400.,   1.,   1.,  40.]], device='cuda:3')
[1,0]<stdout>:rank=0: a=tensor([[100.,   1.,   1.,  10.],
[1,0]<stdout>:        [100.,   1.,   1.,  10.],
[1,0]<stdout>:        [100.,   1.,   1.,  10.]], device='cuda:0')
[1,2]<stdout>:rank=2: a=tensor([[300.,   1.,   1.,  30.],
[1,2]<stdout>:        [300.,   1.,   1.,  30.],
[1,2]<stdout>:        [300.,   1.,   1.,  30.]], device='cuda:2')
[1,1]<stdout>:rank=1: a=tensor([[200.,   1.,   1.,  20.],
[1,1]<stdout>:        [200.,   1.,   1.,  20.],
[1,1]<stdout>:        [200.,   1.,   1.,  20.]], device='cuda:1')
[1,2]<stdout>:hvd.local_rank()=2 : dist.get_rank()=2
[1,2]<stdout>:hvd.local_rank()=2 sends to left=1
[1,3]<stdout>:hvd.local_rank()=3 : dist.get_rank()=3
[1,3]<stdout>:hvd.local_rank()=3 sends to left=2
[1,1]<stdout>:hvd.local_rank()=1 : dist.get_rank()=1
[1,1]<stdout>:hvd.local_rank()=1 sends to left=0
[1,0]<stdout>:hvd.local_rank()=0 : dist.get_rank()=0
[1,0]<stdout>:hvd.local_rank()=0 receives from right=1
[1,1]<stdout>:hvd.local_rank()=1 receives from right=2
[1,2]<stdout>:hvd.local_rank()=2 receives from right=3
[1,0]<stdout>:RANK: 0 A' = 
[1,0]<stdout>: tensor([[100.,   1.,   1., 208.],
[1,0]<stdout>:        [100.,   1.,   1., 208.],
[1,0]<stdout>:        [100.,   1.,   1., 208.]], device='cuda:0')
[1,1]<stdout>:RANK: 1 A' = 
[1,1]<stdout>: tensor([[200.,   1.,   1., 308.],
[1,1]<stdout>:        [200.,   1.,   1., 308.],
[1,1]<stdout>:        [200.,   1.,   1., 308.]], device='cuda:1')
[1,2]<stdout>:RANK: 2 A' = 
[1,2]<stdout>: tensor([[300.,   1.,   1., 408.],
[1,2]<stdout>:        [300.,   1.,   1., 408.],
[1,2]<stdout>:        [300.,   1.,   1., 408.]], device='cuda:2')
[1,3]<stdout>:RANK: 3 A' = 
[1,3]<stdout>: tensor([[400.,   1.,   1.,  40.],
[1,3]<stdout>:        [400.,   1.,   1.,  40.],
[1,3]<stdout>:        [400.,   1.,   1.,  40.]], device='cuda:3')
Run completed at:- 
Sa 14. Mai 14:19:12 CEST 2022