Recently our lab set up a new machine with RTX3090, I installed GPU driver 460.32, CUDA 11.2, and pytorch through
conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
Then I tested with a seq2seq model (LSTM->LSTM) I used before, training very fast, working fine.
However, when I use this machine to train a TextCNN classification model, I find it is much slower than even when I did it on my laptop with GTX1660ti (cuda10.2+torch1.7.0).
There is no error message during training, though. To analyze where the problem is, I just added some timer in the program and found the difference.
for step, (x_batch, y_batch) in enumerate(train_loader): t1=time.clock() x_batch = x_batch.to(device) y_batch = y_batch.to(device) t2=time.clock() print('transfer data needs %s ms' % ((t2 - t1) * 1000)) output = model(x_batch) #training steps... t3=time.clock() print('training step uses %s ms' % ((t3 - t2) * 1000))
On my laptop, the data loading time is just 0.06ms, the following training steps take about 3.5~4.5ms.
However, on the 3090 machine, the data loading time needs about 102ms, while training steps only take only 2.3ms.
I also tried the same training operation on the old server of my lab with RTX2080 (cuda11.0 torch1.7.1), data loading time 0.03ms, training steps take 1.9ms. (P.S. I also noticed that on 2080 machine the ‘GPU-Util Compute M.’ (can be seen on nvidia-smi) was just about 54% through the entire training procedure, but the 3090 machine kept 100% from the start.)
The codes put on these 3 machine are exactly the same, but why the tensor.to(device) on 3090 machine is so slow?
(By the way, I also tried a 3MLP GAN network which works fine on my laptop and old 2080 server, too…But again, tensor.to(device) on 3090 is slow…)
The data loading methods of Seq2Seq and TextCNN are also almost the same, using
train = torch.utils.data.TensorDataset(x_train, y_train) train_loader = DataLoader(train, batch_size=BATCH_SIZE, shuffle=True)
Here is the system info of that new server with RTX3090.
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.2.142
GPU models and configuration: GPU 0: GeForce RTX 3090
Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.0.221 h6bb024c_0
[conda] mkl 2020.0 166
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.15 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] numpy 1.18.1 py37h4f9e942_0
[conda] numpy-base 1.18.1 py37hde5b4d6_1
[conda] numpydoc 0.9.2 py_0
[conda] pytorch 1.7.1 py3.7_cuda11.0.221_cudnn8.0.5_0 pytorch
[conda] torchaudio 0.7.2 pypi_0 pypi
[conda] torchvision 0.8.2+cu110 pypi_0 pypi