Why Pytorch 1.7 with cuda10.1 cannot compatible with Nvidia A100 Ampere Architecture (according to PTX compatibilty pricinple)

According to Nvidia official documentation, if CUDA appliation is built to include PTX, because the PTX is forward-compatible, Meaning PTX is supported to run on any GPU with compute capability higher than the compute capability assumed for generation of that PTX. so I try to find whether torch-1.7.0+cu101 is compiled to binary with PTX, and the fact seem like that pytorch actually compiled with nvcc compile flag “-gencode=arch=compute_xx,code=sm_xx” pytorch CMakeLists.txt.I think this flag means after compiling pytorch , the compiled product contains the PTX. However, when I try to use pytorch1.7 with cuda10.1 in A100,there is always error.

>>> import torch
>>> torch.zeros(1).cuda()
/data/miniconda3/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/miniconda3/lib/python3.7/site-packages/torch/tensor.py", line 179, in __repr__
  return torch._tensor_str._str(self)
File "/data/miniconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 372, in _str
return _str_intern(self)
File "/data/miniconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 352, in _str_intern
  tensor_str = _tensor_str(self, indent)
File "/data/miniconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 241, in _tensor_str
  formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/data/miniconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 89, in __init__
  nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: CUDA error: no kernel image is available for execution on the device

so ,i really want to know,why “PTX compatibilty pricinple” does not apply to pytorch. there are other answers which only tell to use cuda11 or higher ,and i know it works.But they don’t tell me the real reason – why pytorch for cuda10.1 does not work for A100. I try use cuda10.1 samples in toolkit, and these small demo applications acctually work.

[Matrix Multiply Using CUDA] - Starting...
MapSMtoCores for SM 8.0 is undefined.  Default to use 64 Cores/SM
GPU Device 0: "A100-SXM4-40GB" with compute capability 8.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
Performance= 4286.91 GFlop/s, Time= 0.031 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.

If anyone could help me with an answer I would be very grateful.

If I’m not mistaken, the pip wheels do not use +PTX while the conda binaries might add it.
Nevertheless, if you are using the CUDA10 binaries, they would also ship with e.g. cuDNN7 which isn’t compatible for your A100, so the right approach is to use the supported toolkit starting with CUDA11.0.

thank you for your reply ~~~.

After you metioned the cuDNN may have compatibility issues whith compute capability, i read the cuDNN offical documentation, that is true cuDNN7.6.5 is not supported by Nvidia Ampere architecture, the only version of cuDNN supported by Ampere is cuDNN 8 or higher.

I feel the compilation process and compatibiltiy pricinple in pytorch is really a huge complex project.

I also post a question in stackoverflow (cuda - Why Pytorch 1.7 with cuda10.1 cannot compatible with Nvidia A100 Ampere Architecture (according to PTX compatibilty pricinple) - Stack Overflow ). however one responder said the cuDNN has PTX embedded code and will run on newer hardware. so how can i check the conda binaries do have PTX binary.
I try to compile the pytorch from source,use TORCH_CUDA_ARCH_LIST=“6.0+PTX;7.0+PTX;7.5+PTX” USE_NCCL=0 USE_DISTRIBUTED=0 USE_CUDNN=10 python setup.py install , but after compile, i try “torch.zeros(1).cuda()”, the result is still “RuntimeError: CUDA error: no kernel image is available for execution on the devic”.

That is not true and I would refer to the official support matrix created by the cuDNN engineers as seen here.