RTX 3070: AMP doesn't seem to be working

I have recently built a DL workstation based on GeForce RTX 3070 card.

The problem I have so far is that I could not get my graphical card to work with AMP.

I’m using PyTorch Lightning to enable AMP in my project which in turn uses PyTorch native AMP support. It works for me in Kaggle kernels, but not on my workstation. It doesn’t matter whenever I configure half or full precision, the memory consumptions are the same (the batch size did not changed during my checks):

Here is my PyTorch version:

PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 
  - PyTorch Version: 1.8.1+cu111

I have installed it via

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.htm

Here is another piece of information from the PyTorch env command:

Collecting environment information...
PyTorch version: 1.8.1+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.10 (x86_64)
GCC version: (Ubuntu 10.2.0-13ubuntu1) 10.2.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 3070
Nvidia driver version: 460.56
cuDNN version: Probably one of the following:
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] pytorch-lightning==1.2.3
[pip3] torch==1.8.1+cu111
[pip3] torchvision==0.9.1+cu111
[conda] Could not collect

Do you have any ideas what could be wrong?

Thank you in advanced!

nvidia-smi will report the used device memory by all processes and will also show the allocated and reserved memory in PyTorch.
The memory saving depends on the actual model as well as the used algorithms and you could check the detailed memory stats via torch.cuda.memory_summary().

1 Like

@ptrblck Hi, Piotr :wave: Glad you replayed!

Could not get torch.cuda.memory_summary() working for some reasons. I have executed the training from one SSH connection and then tried to run python3 -c 'import torch; print(torch.cuda.memory_summary())' from another SSH connection. However, I got all zeroes which is weird:

At the same time, nvtop and nvidia-smi showed that GPU was utilised:

In any case, I think that it doesn’t seem to be just a nvidia-smi information misleading.

Namely, I checked that my biggest batch size on the current dataset with FP32 is 8 samples. 9 gives OOM errors. When I enable FP16, the max batch_size still seems to be 8 and 9 still gives me OOM errors.

My model is PyTorch vanilla implementation of the RetinaNet (torchvision.models.detection.retinanet_resnet50_fpn + torchvision.models.detection.retinanet.RetinaNetHead) which is applied to a custom image dataset.

Is there another quick way to check that FP16 is really working on my env (to exclude a case when my particular usage of RetinaNet is somehow faulty)?

Could it be that the data/gradients aren’t being cast to the half-precision variants? I don’t know how lightning works or if it’s supposed to include this automatically, but if autocast isn’t included then it won’t matter that AMP is imported.

1 Like

Hi @cpeters :wave:
When I did FP16 with PyTorch Lightning in Kaggle kernels, the precision config was the only change I did to enable AMP. However, indeed, the dataset was 8bits images.

This time I’m running on a dataset that I created myself (using just a mobile camera). So possibly this is the reason of the issue.

Does it mean that I need to include bits downscaling into my data processing routine?

…just in case, I have posted a thread on the PyTorch Lightning side about this issue:

Thank you,

It sounds like it should be a problem on Lightning’s end? There’s nothing about the 3000 series that doesn’t work with AMP (I’ve done it here), and when writing native pytorch the casting is done explicitly.

1 Like

@cpeters, Most likely. It’s getting interesting.

I have executed the project, which I used to test AMP in Kaggle kernels, in my workstation. AMP seems to be working there. So yeah, seems like it’s some issues on Lighting side which my combination of model/dataset discovered :relieved:

In any case, thank you for the help!