Why nn.LSTM still use float16 in hidden_state, even if set to bfloat16 or float32 already?

Here is the code:

import torch
import torch.nn as nn


rnn = nn.LSTM(10, 20, 2).cuda()
input = torch.randn(5, 3, 10).cuda()
h0 = torch.randn(2, 3, 20).cuda()
c0 = torch.randn(2, 3, 20).cuda()

with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    output, (hn, cn) = rnn(input, (h0, c0))
    print(hn)

Then I find the dtype of hn is float16, I don’t know why.
My torch version is 2.4.0