I am finetuning wav2vec2 on my own data. I am running this on k80 and as it doesn’t support fp16. When I was training with fp16 flag got loss scale reached to 0.0001
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.
Then I switched to FP32 but loss became nan this time:
getting log:
/data/fairseq/fairseq/utils.py:306: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
/data/fairseq/fairseq/utils.py:306: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
2020-10-01 10:25:50 | WARNING | root | NaN or Inf found in input tensor.
2020-10-01 10:25:50 | WARNING | root | NaN or Inf found in input tensor.
2020-10-01 10:25:50 | WARNING | root | NaN or Inf found in input tensor.
2020-10-01 10:25:50 | INFO | train | {"epoch": 73, "train_loss": "nan", "train_ntokens": "15558.8", "train_nsentences": "427.098", "t
rain_nll_loss": "nan", "train_wps": "2443", "train_ups": "0.16", "train_wpb": "15558.8", "train_bsz": "427.1", "train_num_updates": "
9291", "train_lr": "1e-08", "train_gnorm": "nan", "train_loss_scale": null, "train_train_wall": "867", "train_wall": "0"}
2020-10-01 10:25:50 | INFO | fairseq.trainer | begin training epoch 74
2020-10-01 10:40:41 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-10-01 10:42:40 | WARNING | root | NaN or Inf found in input tensor.
2020-10-01 10:42:40 | WARNING | root | NaN or Inf found in input tensor.
2020-10-01 10:42:40 | INFO | valid | {"epoch": 74, "valid_loss": "nan", "valid_ntokens": "2570.42", "valid_nsentences": "71.4286", "v
alid_nll_loss": "nan", "valid_uer": "100", "valid_wer": "100", "valid_raw_wer": "100", "valid_wps": "2780.3", "valid_wpb": "2570.4",
"valid_bsz": "71.4", "valid_num_updates": "9454", "valid_best_wer": "100"}
Any suggestions on how to overcome such conditions.