vgoklani
(Vishal Goklani)
May 4, 2023, 3:39pm
1
Hey there,
Is there a pytorch method for fetching the shared memory capacity
for the sm_89 series of chips (specifically the NVIDIA A6000 Ada cards)?
It’s my understanding that the sm_86 has a shared memory capacity
size of ~100k, which caps the head-dimension to 64 for flash attention. The A100s (sm_80) have 164kB, which allows a head-dimension of 128. I’d like to check this value for sm_89, so I could figure out the max head-dimension.
Is there another name for the shared memory capacity
, I don’t think it’s the same as the L1 cache. What should I look for in the Ampere/AdaLovelace/Hopper white papers?
reference: NVIDIA Ampere GPU Architecture Tuning Guide
One more thing: Is there a torch equivalent to this:
from numba.cuda.cudadrv import enums
from numba import cuda
device = cuda.get_current_device()
attribs = [
name.replace("CU_DEVICE_ATTRIBUTE_", "")
for name in dir(enums)
if name.startswith("CU_DEVICE_ATTRIBUTE_")
]
for attr in attribs:
print(attr, "=", getattr(device, attr))
I tried:
torch.cuda.get_device_properties()
but it doesn’t return all the attributes. In the numba implementation, the device attributes are appended lazily
, hence the iteration through the enum
.
Thank you!
eqy
May 4, 2023, 7:57pm
2
sm89 should have the same restriction as sm86, barring future improvements to flash attention implementations. You can find the documented shared memory capacity in the Ada tuning guide similar to the one you linked for Ampere here: NVIDIA Ada GPU Architecture Tuning Guide
This limitation is enforced in upstream PyTorch’s flash attention here: pytorch/sdp_utils.cpp at 8994d9e6109c541a1d581c383e4de9ed68205d91 · pytorch/pytorch · GitHub
vgoklani
(Vishal Goklani)
May 4, 2023, 8:10pm
3
Thanks @eqy - but then how are they able to support a head_dim of 96 here:
opened 01:10PM - 01 Apr 23 UTC
closed 10:13PM - 24 Apr 23 UTC
high priority
triage review
module: cuda
triaged
oncall: transformer/mha
### 🐛 Describe the bug
Similar to https://github.com/pytorch/pytorch/issues/9… 4883
I'm trying to run textual inversion training using stable-diffusion with pytorch 2.0 using RTX 4090 and seeing `Expected is_sm80 to be true, but got false` which I believe should not trigger here?
`git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git && git checkout a9fed7c364061ae6efb37f797b6b522cb3cf7aa2 `
modify webui-user.bat `COMMANDLINE_ARGS` and `TORCH_COMMAND` to use 2.0.0
```
set COMMANDLINE_ARGS= --opt-sdp-attention
set TORCH_COMMAND=pip install torch==2.0.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu118
```
Run `webui-user.bat`
`Train -> Train -> Train embedding`
```
traceback (most recent call last):
File "E:\git\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 503, in train_embedding
scaler.scale(loss).backward()
File "E:\git\stable-diffusion-webui\venv\lib\site-packages\torch\_tensor.py", line 487, in backward
torch.autograd.backward(
File "E:\git\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "E:\git\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\function.py", line 274, in apply
return user_fn(self, *args)
File "E:\git\stable-diffusion-webui\venv\lib\site-packages\torch\utils\checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "E:\git\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Expected is_sm80 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
```
```
python -c "import torch;print(torch.__config__.show(), torch.cuda.get_device_properties(0))"
PyTorch built with:
- C++ Version: 199711
- MSVC 193431937
- Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
- OpenMP 2019
- LAPACK is enabled (usually provided by MKL)
- CPU capability usage: AVX2
- CUDA Runtime 11.8
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.7
- Magma 2.5.4
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj /FS -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=OFF, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,
_CudaDeviceProperties(name='NVIDIA GeForce RTX 4090', major=8, minor=9, total_memory=24563MB, multi_processor_count=128)
```
I see that `CudaDeviceProperties` reports `major=8` and `minor=9` and NVCC architecture flags have `sml_80` so I assume check should not fail? Or am I misunderstanding something here?
### Versions
Collecting environment information...
PyTorch version: 2.0.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 10 Pro
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19044-SP0
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 531.41
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture=9
CurrentClockSpeed=3900
DeviceID=CPU0
Family=107
L2CacheSize=4096
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=3900
Name=AMD Ryzen 7 3800X 8-Core Processor
ProcessorType=3
Revision=28928
Versions of relevant libraries:
[pip3] numpy==1.23.3
[pip3] open-clip-torch==2.7.0
[pip3] pytorch-lightning==1.7.6
[pip3] torch==2.0.0+cu118
[pip3] torchdiffeq==0.2.3
[pip3] torchmetrics==0.11.4
[pip3] torchsde==0.2.5
[pip3] torchvision==0.15.1+cu118
[conda] Could not collect
cc @ezyang @gchanan @zou3519 @ngimel @jbschlosser @bhosmer @cpuhrsch @erichan1
eqy
May 4, 2023, 8:12pm
4
I believe that would qualify as an improvement to the implementation rather than relying on the current kernels.
1 Like