Autocast causes GRU memory to explode on AMD ROCm

I have an example where the gpu memory used by the GRU explodes on the AMD ROCm but not on a NVIDIA gpu. The issue seems to be caused by autocast, with autocast disabled the memory usage is comparable. Without autocast the allocated memory on AMD is 0.4GB, with autocast the AMD uses 5.1GB. The nvidia gpu uses 0.2GB in both cases. If possible I would still like to use autocast on AMD as well.

# -*- coding: utf-8 -*-
import torch
from torch import nn
from torch.cuda.amp import autocast
import sys

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

rnn = nn.GRU(input_size=768, hidden_size=512, batch_first=True,
                                 bidirectional=True, num_layers=3, dropout=0.5).to(device)

inputs = torch.randn(10, 231, 768).to(device)
h0 = torch.randn(2, 3, 768).to(device)
print('Allocated after init:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')

with autocast():
    output, hn = rnn(inputs)
    print(output.dtype)
        
print('Allocated after rnn pass:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
print('Size of output:', sys.getsizeof(output.storage()))
print('Size of hn:', sys.getsizeof(hn.storage()))

Output on nvidia:

Allocated after init: 0.1 GB
torch.float32
Allocated after rnn pass: 0.2 GB
Size of output: 9461832
Size of hn: 122952

Output on AMD:

Allocated after init: 0.1 GB
torch.float16
Allocated after rnn pass: 5.1 GB
Size of output: 4730928
Size of hn: 61488