Data copy between GPUs failed.(Tesla A100, cuda11.1, cudnn8.1.0,pytorch1.8)

Hi, everyone. The configuration I use is Tesla A100, cuda11.0, cudnn8.0.4,pytorch1.7.1. But when I run the following code:

import torch
a = torch.randn(2, 3, device='cuda:0')
print(a)
b = a.to('cuda:1')
print(b)

The output are:

tensor([[-2.0747, -0.5964, -0.3556],
        [-0.5845,  0.3389, -0.3725]], device='cuda:0')

tensor([[0., 0., 0.],
        [0., 0., 0.]], device='cuda:1')

I am now trying to confirm if this problem is caused at the code level (pytroch,cuda,cudnn) or if there is something wrong with the gpu installation. Thanks!

Could you update to the latest stable release (1.8.0) and rerun the code?
Such error might be software or hardware related and it’s hard to tell it just from these symptoms.

@ptrblck, thank you for your answer. Now I update to the latest stable release 1.8.0, and I run the following code:

import torch
import torch.utils
import torch.utils.cpp_extension

print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
print(torch.utils.cpp_extension.CUDA_HOME)

a = torch.randn(2, 3, device='cuda:0')
print(a)
b = a.to('cuda:1')
print(b)

The output are:

1.8.0+cu111
True
8
A100-PCIE-40GB
/usr/local/cuda-11.1
tensor([[-0.9957, -0.9985,  1.1794],
        [-1.4586, -0.0102, -0.0106]], device='cuda:0')
tensor([[0., 0., 0.],
        [0., 0., 0.]], device='cuda:1')

Below is my driver configuration:

And now I use cuda11.1, cudnn8.1.0.

Do you have any more suggestions to help me troubleshoot the cause (software and hardware) of this problem? Thank you very much.

I tried it on a system with 8x A100s and get a valid output:

1.8.0
True
8
A100-SXM4-40GB
/usr/local/cuda
tensor([[ 0.0440,  1.6352, -0.7515],
        [ 1.6543, -0.5374, -0.8127]], device='cuda:0')
tensor([[ 0.0440,  1.6352, -0.7515],
        [ 1.6543, -0.5374, -0.8127]], device='cuda:1')

Used driver: 460.32.03.
Could you check dmesg for any Xids and post them here?

@ptrblck, I don’t know how to check dmesg for any Xids :sweat_smile:. When I run dmesg, it outputs a lot of information, and Xid is not included in the information. Should I post all of them here? And what command should I run? Thanks!

You can use dmesg -T | grep xid and check for the most recent one (if there are any).
Don’t post the complete log here, as it should be quite long :wink:

@ptrblck, after running dmesg -T | grep xid, it outputs nothing :joy:. Will the difference between A100-PCIE-40GB and A100-SXM4-40GB lead to this problem?

And I found some new and interesting phenomena. When I run the following code:

import torch

a = torch.randn(2, 3, device='cuda:0')
print(a)
b = torch.ones(2, 3, device='cuda:1')
print(b)
a1 = a.to('cuda:1')
print(a1)
b1 = b.to('cuda:0')
print(b1)
print(a1 is b)
print(b1 is a)

The output is:

tensor([[-1.7723, -0.0860, -1.6667],
        [ 3.1190, -0.2531, -0.7271]], device='cuda:0')
tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:1')
tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:1')
tensor([[-1.7723, -0.0860, -1.6667],
        [ 3.1190, -0.2531, -0.7271]], device='cuda:0')
False
False

The value of a1 is equal to b and the value of b1 is equal to a. I wonder if this phenomenon can provide new clues.

@ptrblck , I just run the cuda-sample p2pBandwidthLatencyTest which demonstrates Peer-To-Peer (P2P) data transfers between pairs of GPUs and computes latency and bandwidth., and it outputs:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, A100-PCIE-40GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, A100-PCIE-40GB, pciBusID: 24, pciDeviceID: 0, pciDomainID:0
Device: 2, A100-PCIE-40GB, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 3, A100-PCIE-40GB, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device: 4, A100-PCIE-40GB, pciBusID: 81, pciDeviceID: 0, pciDomainID:0
Device: 5, A100-PCIE-40GB, pciBusID: a1, pciDeviceID: 0, pciDomainID:0
Device: 6, A100-PCIE-40GB, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
Device: 7, A100-PCIE-40GB, pciBusID: e1, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3     4     5     6     7
     0       1     1     1     1     1     1     1     1
     1       1     1     1     1     1     1     1     1
     2       1     1     1     1     1     1     1     1
     3       1     1     1     1     1     1     1     1
     4       1     1     1     1     1     1     1     1
     5       1     1     1     1     1     1     1     1
     6       1     1     1     1     1     1     1     1
     7       1     1     1     1     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1154.84  14.60  14.67  14.33  15.10  15.62  15.94  17.48
     1  21.10 1290.26  15.47  15.56  15.70  15.75  16.02  17.27
     2  15.24  15.12 1177.47  14.75  15.01  15.67  15.89  17.55
     3  15.13  14.92  15.05 1167.79  16.10  16.14  16.37  17.47
     4  15.11  15.22  15.12  21.26 1292.39  17.51  18.07  15.18
     5  15.15  15.27  15.11  18.93  21.74 1292.39  18.05  16.26
     6  15.02  15.60  19.99  21.53  21.61  21.34 1291.32  15.86
     7  14.74  17.48  21.25  21.47  21.58  21.34  21.59 1295.61
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1156.55   2.11   2.78   2.78   2.78   2.78   2.78   2.78
     1   1.24 1291.32   1.62   1.63   1.63   1.33   1.62   1.62
     2   2.40   2.78 1293.46   2.78   2.78   2.78   2.78   2.78
     3   2.41   2.17   2.78 1293.46   2.78   2.78   2.78   2.78
     4   2.14   2.44   2.78   2.78 1293.46   2.78   2.78   2.78
     5   2.33   2.17   2.78   2.78   2.78 1291.32   2.78   2.78
     6   2.40   2.44   2.78   2.78   2.44   2.78 1293.46   2.78
     7   2.14   2.78   2.78   2.78   2.78   2.78   2.78 1293.46
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1180.58  20.51  19.38  19.44  20.11  20.03  20.32  23.85
     1  23.52 526.72  19.13  19.57  19.82  19.82  20.03  29.41
     2  19.88  18.65 1204.24  18.97  19.56  19.41  19.60  29.57
     3  19.53  20.43  19.69 1291.32  20.39  20.29  20.53  23.41
     4  19.79  20.58  21.06  22.56 1306.98  21.44  21.89  20.29
     5  19.89  20.85  20.41  22.24  30.66 1308.08  21.96  21.02
     6  19.60  21.33  19.79  22.55  30.23  30.73 1308.63  20.59
     7  20.16  29.87  21.30  23.69  22.56  25.27  23.09 1306.98
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1184.16   3.25   5.43   4.79   4.28   4.83   4.80   4.81
     1   3.25 487.44   2.66   3.24   3.25   3.25   1.87   3.24
     2   4.70   3.24 1309.17   4.82   4.28   4.32   4.80   4.83
     3   3.87   3.23   5.57 1304.26   4.28   4.83   4.29   3.87
     4   3.87   2.65   4.23   5.57 1308.63   4.28   5.45   4.81
     5   4.73   2.65   4.77   4.28   3.90 1308.63   4.80   4.24
     6   4.20   3.23   4.79   4.24   4.28   4.28 1309.72   4.79
     7   4.28   2.65   4.29   5.43   4.83   4.79   4.29 1304.80
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   4.34  23.57  23.59  23.67  22.00  21.84  22.72  21.99
     1  23.53   2.31  21.76  21.65  15.04  18.91  19.76  18.93
     2  23.57  14.54   2.37  20.29  21.53  15.93  18.08  15.33
     3  23.58  21.56  21.56   2.33  16.20  20.58  21.47  20.71
     4  22.91  15.80  15.57  19.92   2.38  20.55  20.56  18.27
     5  22.50  17.00  19.01  17.91  20.55   2.38  20.58  20.54
     6  22.93  14.91  20.48  12.59  20.56  20.58   2.30  19.28
     7  22.30  21.55  13.68  21.55  21.55  21.55  21.55   2.50

   CPU     0      1      2      3      4      5      6      7
     0   3.79  11.84  12.26  11.93  10.96  10.78  10.57  11.15
     1  12.22   3.54  11.48  11.29  10.21  10.22   9.91  10.45
     2  12.05  11.31   3.59  11.17  10.09  10.12   9.83  10.39
     3  11.93  11.12  11.06   3.53  10.00  10.05   9.78  10.21
     4  11.18  10.38  10.46  10.22   3.24   9.45   9.19   9.56
     5  11.12  10.40  10.50  10.28   9.38   3.15   9.21   9.59
     6  10.95  10.18  10.29  10.10   9.18   9.25   3.08   9.46
     7  11.62  10.76  10.86  10.51   9.54   9.53   9.33   3.77
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   4.35 49204.95 49204.80 49204.80 49204.78 49204.77 49204.79 49204.79
     1 49205.36   2.31 49204.98 49204.97 49205.17 49204.88 49205.20 49205.15
     2 49204.99 49204.85   2.34 49204.81 49204.84 49204.82 49204.76 49204.83
     3 49205.01 49204.90 49204.84   2.33 49204.84 49204.89 49204.86 49204.79
     4 49204.87 49204.77 49204.73 49204.75   2.39 49204.68 49204.75 49204.73
     5 49204.92 49204.78 49204.81 49204.81 49204.78   2.38 49204.80 49204.76
     6 49205.02 49204.82 49204.87 49204.86 49204.85 49204.83   2.29 49204.85
     7 49205.04 49204.88 49204.90 49204.86 49204.82 49204.83 49204.85   2.47

   CPU     0      1      2      3      4      5      6      7
     0   5.10   2.95   3.02   3.08   3.01   3.03   3.01   2.97
     1   2.98   3.78   2.75   2.85   2.87   2.86   2.86   2.91
     2   2.96   2.77   3.81   2.81   3.05   2.83   2.98   2.72
     3   3.02   2.84   2.77   3.72   2.83   2.86   2.83   2.80
     4   2.82   2.60   2.53   2.62   3.38   2.59   2.63   2.53
     5   2.53   2.56   2.53   2.57   2.43   3.46   2.49   2.38
     6   2.66   2.66   2.55   2.51   2.74   2.67   6.27   4.33
     7   2.70   2.62   2.56   2.58   2.67   2.66   2.57   4.59

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

I wonder if this results can provide new clues.

Thanks for the input.
I’ll follow up with internal teams on what the next debug steps are.
It seems that the communication is somehow failing, but I’m not sure what might be causing this.

Could you post more information about your system? I.e. is it a custom workstation or a specific pre-configured server?

EDIT:
Are you seeing the same issue using different devices?
E.g. cuda:0 to cuda:2 etc.?
Also, for the failing case, could you add torch.cuda.synchronize() before printing b?

I also cannot reproduce this issue on a node with 8x A100 PCIE.
Could you check, if ACS is turned off via:

lspci -vvvv | grep ACSCtl:

It should return a lot of ACSCtl: SrcValid-, where SrcValid- shows that ACS is indeed turned off.

@ptrblck, Thank you very much for your kind help :grinning:.

1.For the following question:

Could you post more information about your system? I.e. is it a custom workstation or a specific pre-configured server?

It 's a custom workstation, which has 2 AMD cpu (EPYC 7742) supporting PCIE4 and 1T memory. lnk and sta info are below:

2.For the following question:

Are you seeing the same issue using different devices? E.g. cuda:0 to cuda:2 etc.?

When I run the following code:

import torch
a = torch.randn(2, 3, device='cuda:0')
print(a)
b = a.to('cuda:2')
print(b)

The outputs are:

tensor([[-0.2846,  0.5795,  1.2842],
        [ 0.3382, -0.4902, -0.8187]], device='cuda:0')
tensor([[0., 0., 0.],
        [0., 0., 0.]], device='cuda:2')

3.For the following question:

Also, for the failing case, could you add torch.cuda.synchronize() before printing b ?

When I run the following code:

import torch
a = torch.randn(2, 3, device='cuda:0')
print(a)
b = a.to('cuda:2')
torch.cuda.synchronize()
print(b)

The outputs are:

tensor([[ 1.3674, -1.1252, -0.1123],
        [ 0.4165,  0.7612,  0.4003]], device='cuda:0')
tensor([[0., 0., 0.],
        [0., 0., 0.]], device='cuda:2')

4.When I run lspci -vvvv | grep ACSCtl:, the outputs are:

		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
pcilib: sysfs_read_vpd: read failed: Input/output error
pcilib: sysfs_read_vpd: read failed: Input/output error
pcilib: sysfs_read_vpd: read failed: Input/output error
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
pcilib: sysfs_read_vpd: read failed: Input/output error
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

There are some pcilib: sysfs_read_vpd: read failed: Input/output error message.

Thank you! I’ll forward it to our internal team to discuss it further.

Thank you very much and look forward to your reply :relaxed:.

@ptrblck, Any progress so far? I look forward to your reply.

No, we would have to wait until work starts again on Monday. :wink:

Sorry, it is Monday now in China :joy:. Have a great weekend.

Sorry for the late reply, but I just revisited this topic by chance and took another look at the last outputs.
Could you check the NCCL troubleshooting and disable ACS?