I have an example where the gpu memory used by the GRU explodes on the AMD ROCm but not on a NVIDIA gpu. The issue seems to be caused by autocast, with autocast disabled the memory usage is comparable. Without autocast the allocated memory on AMD is 0.4GB, with autocast the AMD uses 5.1GB. The nvidia gpu uses 0.2GB in both cases. If possible I would still like to use autocast on AMD as well.
# -*- coding: utf-8 -*- import torch from torch import nn from torch.cuda.amp import autocast import sys device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') rnn = nn.GRU(input_size=768, hidden_size=512, batch_first=True, bidirectional=True, num_layers=3, dropout=0.5).to(device) inputs = torch.randn(10, 231, 768).to(device) h0 = torch.randn(2, 3, 768).to(device) print('Allocated after init:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB') with autocast(): output, hn = rnn(inputs) print(output.dtype) print('Allocated after rnn pass:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB') print('Size of output:', sys.getsizeof(output.storage())) print('Size of hn:', sys.getsizeof(hn.storage()))
Output on nvidia:
Allocated after init: 0.1 GB torch.float32 Allocated after rnn pass: 0.2 GB Size of output: 9461832 Size of hn: 122952
Output on AMD:
Allocated after init: 0.1 GB torch.float16 Allocated after rnn pass: 5.1 GB Size of output: 4730928 Size of hn: 61488