Autocast not working in huggingface nlp models

Hi I am using roberta-base models from huggingface finetuning RTE dataset with fp16.
I am using torch.cuda.amp.autocast() but the output of the model is still torch.float32 and there is no memory saving at all.
Here is the code

model = AutoModelForSequenceClassification.from_pretrained("roberta-base")
model = model.to(rank)
model = torch.nn.parallel.DistributedDataParallel(model)
for epoch in range(epochs):
    model.train()
    for i, batch in enumerate(train_dataloader):
        with autocast():
            outputs = model(batch["input_ids"], batch["attention_mask"])
           loss = outputs.loss
           ...

If I change my model to model.half(), the output is fp16 but will get nan gradient. I tried to use scaler, but scaler does not support fp16 gradient.
How could I train hugginface nlp models in fp16?

I cannot reproduce the issue and get a memory saving using amp and this code:

import torch
from transformers import AutoModelForSequenceClassification
from transformers import RobertaTokenizer


tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

model = AutoModelForSequenceClassification.from_pretrained("roberta-base")
model = model.to('cuda:0')

text = ["Replace me by any text you'd like."]*1024
encoded_input = tokenizer(text, return_tensors='pt')
encoded_input  = {k: encoded_input[k].to('cuda') for k in encoded_input}

with torch.cuda.amp.autocast(enabled=True):
    outputs = model(**encoded_input)
    loss = outputs.loss

print(torch.cuda.memory_allocated()/1024**2)
# AMP: 5147.49072265625
# FP32: 7547.36669921875

Thank you very much!
I saw the memory saving, too.
It was my mistake.