Can't run forward pass of WaveRNN model due to unsuccessful GPU RAM allocation

j-silv · April 23, 2025, 3:28am

When running a minimal model example of the WaveRNN, my google colab session using a T4 (16 GB RAM) GPU is unable complete a forward pass due to a cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. I suspect this is due to the large size of the waveform and specgram. If I set the n_time variable below to anything > 331, the same error occurs. However, if I set n_time to anything < 330, then the error does not occur and the forward pass completes successfully.

The code snippet is provided in this Google colab: Google Colab

Steps to recreate:

1) Create model

import torch
from torch import nn
import torchaudio
from torchaudio.models.wavernn import WaveRNN

device = "cuda" if torch.cuda.is_available() else "cpu"

n_time = 2401 # choosing n_time < 331 then no error, >= 331 then error
kernel_size = 5
hop_length = 200
bits = 8
n_freq = 80

model = WaveRNN(
    upsample_scales=[5, 5, 8],
    n_classes=2**bits,
    hop_length=hop_length,
    n_freq=80
)

model = model.to(device)
model.train()

waveform = torch.zeros((1, 1, (n_time - kernel_size + 1)*hop_length))
specgram = torch.zeros((1, 1, n_freq, n_time))

waveform = waveform.to(device)
specgram = specgram.to(device)

print(waveform.device, specgram.device, waveform.shape, specgram.shape)
print(waveform.is_contiguous(), specgram.is_contiguous())

print(f"{torch.cuda.memory_allocated()/1024**2:.3f}")
print(f"{torch.cuda.memory_reserved()/1024**2:.3f}")

Output

cuda:0 cuda:0 torch.Size([1, 1, 479400]) torch.Size([1, 1, 80, 2401])
True True
19.187
44.000

2) Run forward pass

torch.cuda.memory._record_memory_history()
output = model(waveform, specgram)

Output

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-86c87abd6ab4> in <cell line: 0>()
      1 torch.cuda.memory._record_memory_history()
      2 
----> 3 output = model(waveform, specgram)

5 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1737             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1738         else:
-> 1739             return self._call_impl(*args, **kwargs)
   1740 
   1741     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1748                 or _global_backward_pre_hooks or _global_backward_hooks
   1749                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750             return forward_call(*args, **kwargs)
   1751 
   1752         result = None

/usr/local/lib/python3.11/dist-packages/torchaudio/models/wavernn.py in forward(self, waveform, specgram)
    309         x = self.fc(x)
    310         res = x
--> 311         x, _ = self.rnn1(x, h1)
    312 
    313         x = x + res

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1737             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1738         else:
-> 1739             return self._call_impl(*args, **kwargs)
   1740 
   1741     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1748                 or _global_backward_pre_hooks or _global_backward_hooks
   1749                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750             return forward_call(*args, **kwargs)
   1751 
   1752         result = None

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/rnn.py in forward(self, input, hx)
   1391         self.check_forward_args(input, hx, batch_sizes)
   1392         if batch_sizes is None:
-> 1393             result = _VF.gru(
   1394                 input,
   1395                 hx,

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.

3) print allocated memory

torch.cuda.memory._dump_snapshot("cuda_memory.pickle")
print(f"{torch.cuda.memory_allocated()/1024**2:.3f}")
print(f"{torch.cuda.memory_reserved()/1024**2:.3f}")

Output

1771.498
2762.000

With the google collab, you can visualize the memory allocation with the cuda_memory.pickle snapshot via https://pytorch.org/memory_viz

The machine environment information:

!wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
!python collect_env.py

Output

Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.11.12 (main, Apr  9 2025, 08:55:54) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.123+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.5.82
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 550.54.15
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.2.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               2
On-line CPU(s) list:                  0,1
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.00GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   1
Socket(s):                            1
Stepping:                             3
BogoMIPS:                             4000.28
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            32 KiB (1 instance)
L1i cache:                            32 KiB (1 instance)
L2 cache:                             1 MiB (1 instance)
L3 cache:                             38.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0,1
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable; SMT Host state unknown
Vulnerability Meltdown:               Vulnerable
Vulnerability Mmio stale data:        Vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable

Versions of relevant libraries:
[pip3] numpy==2.0.2
[pip3] nvidia-cublas-cu12==12.5.3.2
[pip3] nvidia-cuda-cupti-cu12==12.5.82
[pip3] nvidia-cuda-nvrtc-cu12==12.5.82
[pip3] nvidia-cuda-runtime-cu12==12.5.82
[pip3] nvidia-cudnn-cu12==9.3.0.75
[pip3] nvidia-cufft-cu12==11.2.3.61
[pip3] nvidia-curand-cu12==10.3.6.82
[pip3] nvidia-cusolver-cu12==11.6.3.83
[pip3] nvidia-cusparse-cu12==12.5.1.3
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.5.82
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] nvtx==0.2.11
[pip3] optree==0.15.0
[pip3] pynvjitlink-cu12==0.5.2
[pip3] torch==2.6.0+cu124
[pip3] torchaudio==2.6.0+cu124
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.21.0+cu124
[pip3] triton==3.2.0
[conda] Could not collect

j-silv · May 24, 2025, 6:31pm

Hello,

I’ve investigated this in some more detail and have traced the issue down to a single GRU forward call in the WaveRNN module. I have recreated an even simpler bare-bones example which illustrates the cuDNN error.

import torch
from torch import nn

device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 1
n_rnn = 512

h1 = torch.zeros(1, batch_size, n_rnn, dtype=torch.float32, device=device)
rnn1 = nn.GRU(n_rnn, n_rnn, batch_first=True, device=device)

x1 = torch.ones([batch_size, 65400, n_rnn], device=device)
x2 = torch.ones([batch_size, 65600, n_rnn], device=device)

# prints: True, True, True
print(h1.is_contiguous(), x1.is_contiguous(), x2.is_contiguous())

# runs fine
res, _ = rnn1(x1, h1) 

# fails with:
#  File "venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 1393, in forward
#   result = _VF.gru(
# RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED.
# This error may appear if you passed in a non-contiguous input.
res, _ = rnn1(x2, h1)

It seems that when the sequence length that is sent to the GRU is 65400, the allocation is fine, but with a size of 65600, it fails. This is really surprising because I think the sequence length should fit in the 16 GB T4 GPU just fine. I even verified this was the case with jacobkimmel’s pytorch size estimator:

from pytorch_modelsize import SizeEstimator

se = SizeEstimator(rnn1, input_size=x1.shape)
print(se.estimate_size()) # prints 127 MB
se = SizeEstimator(rnn1, input_size=x2.shape)
print(se.estimate_size()) # prints 128 MB

I’m on a different machine as the original post (a google VM instance), and here is the output gathered from collect_env.py:

Collecting environment information...
PyTorch version: 2.7.0+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.31

Python version: 3.10.15 | packaged by conda-forge | (main, Oct 16 2024, 01:24:24) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 550.90.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        46 bits physical, 48 bits virtual
CPU(s):                               4
On-line CPU(s) list:                  0-3
Thread(s) per core:                   2
Core(s) per socket:                   2
Socket(s):                            1
NUMA node(s):                         1
Vendor ID:                            GenuineIntel
CPU family:                           6
Model:                                63
Model name:                           Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:                             0
CPU MHz:                              2299.998
BogoMIPS:                             4599.99
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            64 KiB
L1i cache:                            64 KiB
L2 cache:                             512 KiB
L3 cache:                             45 MiB
NUMA node0 CPU(s):                    0-3
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities

Versions of relevant libraries:
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu11==11.11.3.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu11==11.8.87
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu11==11.8.89
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu11==11.8.89
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu11==9.1.0.70
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu11==10.9.0.58
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu11==10.3.0.86
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu11==11.4.1.48
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu11==11.7.5.86
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu11==2.21.5
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu11==11.8.86
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pytorch-lightning==2.5.1.post0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchmetrics==1.7.1
[pip3] triton==3.3.0
[conda] numpy                     1.25.2                   pypi_0    pypi

JuanFMontesinos · May 24, 2025, 10:00pm

Silly suggestion, when you do batch first the model interally permutes to become sequence first and then it’s no longer contiguous. (namely, batch first false)
Could you try sequence first?

j-silv · May 24, 2025, 10:16pm

Hmm I tried setting batch_first to false and rearranging the input tensors but I still get the same error.

import torch
from torch import nn

device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 1
n_rnn = 512

h1 = torch.zeros(batch_size, 1, n_rnn, dtype=torch.float32, device=device)
rnn1 = nn.GRU(n_rnn, n_rnn, batch_first=False, device=device)

x1 = torch.ones([65400, batch_size, n_rnn], device=device)
x2 = torch.ones([65600, batch_size, n_rnn], device=device)

# prints: True, True, True
print(h1.is_contiguous(), x1.is_contiguous(), x2.is_contiguous())

# runs fine
res, _ = rnn1(x1, h1)

# fails with:
#  File "venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 1393, in forward
#   result = _VF.gru(
# RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED.
# This error may appear if you passed in a non-contiguous input.
res, _ = rnn1(x2, h1)