Loading model to cuda is extremely slow on A100

Hi, I’m testing my code on a new A100 cluster, I use a 3090 cluster before.

It is extremely slow on A100, after staring at log files, I found it takes 5 minutes loading a model to cuda on A100. However, the time cost on 3090 is around several seconds.

Do I have a wrong environments?

Below is my environments:
PyTorch version: 2.0.0+cu117
CUDA used to build PyTorch: 11.7

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Libc version: glibc-2.35

Python version: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.0.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB

Nvidia driver version: 525.60.13
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Are you able to compile and run the CUDA samples in both the 3090 and A100 environments? As a quick sanity check I would test the host-device bandwidth on both setups and see if something looks wrong there: