Gradient is None initially

hi, im training Vits TTS. at some point me losses go to Nan.
i set autograd.anomaly_detect and it return this error :

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/", line 75, in _wrap
    fn(i, *args)
  File "/content/vits/vits/", line 133, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler,
  File "/content/vits/vits/", line 234, in train_and_evaluate
  File "/usr/local/lib/python3.10/dist-packages/torch/", line 525, in backward
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/", line 267, in backward
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'ConvolutionBackward0' returned nan values in its 0th output.

is it oaky that my model has nan gradients initially ? what should I do?

anomaly detection turned off training at 0ht iteration

Hi Tornike!

No. Once some of your gradients become nan, your optimization step will
cause the associated weights to become nan, causing more gradients to
become nan, and so on.

Get rid of the nans. First check that your input data doesn’t contain any
nans or infs (or other outlandish values).

It’s unlikely, but also verify that your model’s weights aren’t somehow
being initialized with nans or infs.

Then try training with plain-vanilla SGD with a very low learning rate.
Sometimes training is unstable right at the beginning, after which you
can turn the learning rate up to a more practical value (or switch to a
“less stable” optimizer such as Adam). The same goes for momentum.
Even if you want to use a largish value for momentum such as 0.90 or
0.95, you sometimes have to start training with no (or small) momentum.

(My intuition about why this happens is that the randomly-initialized weights
of your model can be out of kilter – imagine them sitting on the side of a
steep valley on the loss surface – leading to very large gradients that can
cause you to jump to an even worse location and finally to inf or nan
gradients. Training very slowly, i.e., small learning rate which means
small optimizer step size, allows your weights to move down the loss
surface, rather than, say, jump to the opposite side of the valley, to a
more “sane” location that has more moderate gradients, at which point
you can manage to increase the learning rate.)


K. Frank

Hi Frank,

Thank you for your valuable reply.

i decided to go with fp32 and it worked. did not return any nan.

but it could not learn. after 48k steps my tts model has robotic voice.

losses almost did not decrase after 5-10k steps. except feature matching loss.

I’m training vits TTS.

any suggestions?