Training Loss = 0.0, Validation Loss = nan

PandaKata · December 16, 2022, 3:16pm

Hello, I am training a model, but the training loss is zero and the validation loss is nan. This only happened when I switched the pretrained model from t5 to mt5.

I don’t know what’s wrong because it was working with t5.

args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_strategy="steps", 
    logging_steps=100,
    save_strategy="steps",
    save_steps=200,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01, 
    save_total_limit=3, 
    num_train_epochs=6, 
    predict_with_generate=True, 
    fp16=True,
    load_best_model_at_end=True,
)

trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Bildschirmfoto 2022-12-16 um 16.17.08

dreidizzle · December 16, 2022, 9:52pm

These models use different data to pretrain. Also, they have different vocabulary tokens, with mT5 being about 8 times as large in its vocabulary (250 k to 32 k). How are you tokenizing? I think you might need to tokenize in different ways and then feed the data to specific models … Maybe this is the issue?

PandaKata · December 17, 2022, 10:18am

Hey Andrei, thank you for getting back to me. I actually found the solution this morning: Apparently with mt5 there are some problems with fp16 - I had to set it to false and now it is working.

dreidizzle · December 17, 2022, 3:15pm

Oh gotcha. This should not an issue in the latest no? I saw some threads on this issue but it seems from 2021 …

PandaKata · December 17, 2022, 3:54pm

Yes, I also read that. Apparently it is still an issue? I also had the problem with flan-t5 which was only released a few weeks ago, so that’s really weird…