Hi I am using roberta-base models from huggingface finetuning RTE dataset with fp16.
I am using torch.cuda.amp.autocast() but the output of the model is still torch.float32
and there is no memory saving at all.
Here is the code
model = AutoModelForSequenceClassification.from_pretrained("roberta-base")
model = model.to(rank)
model = torch.nn.parallel.DistributedDataParallel(model)
for epoch in range(epochs):
model.train()
for i, batch in enumerate(train_dataloader):
with autocast():
outputs = model(batch["input_ids"], batch["attention_mask"])
loss = outputs.loss
...
If I change my model to model.half(), the output is fp16 but will get nan gradient. I tried to use scaler, but scaler does not support fp16 gradient.
How could I train hugginface nlp models in fp16?