Hey I am doing Speech Enhancement and when i start training it ran for few epochs (9) but at 10th epoch training it gave error Input waveform is NAN . I used Lr = 0.0001 and tried changing it but nothing happening .
Epoch 10 | Train Loss: 0.001552 | SmoothL1: 0.001329 | ERB_spec_loss: 0.010675
[rank0]: Traceback (most recent call last):
[rank0]:   File “/data/aman/scripts/attenuate/attenuate_ddp_train.py”, line 283, in
[rank0]:
[rank0]:   File “/data/aman/envs/attenuate/lib/python3.12/site-packages/torch/nn/modules/module.py”, line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File “/data/aman/envs/attenuate/lib/python3.12/site-packages/torch/nn/modules/module.py”, line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File “/data/aman/scripts/attenuate/loss.py”, line 344, in forward
[rank0]:     raise ValueError(“Input waveforms contain NaN or Inf values”)
[rank0]: ValueError: Input waveforms contain NaN or Inf values
[rank1]: Traceback (most recent call last):
[rank1]:   File “/data/aman/scripts/attenuate/attenuate_ddp_train.py”, line 283, in
[rank1]:
[rank1]:   File “/data/aman/envs/attenuate/lib/python3.12/site-packages/torch/nn/modules/module.py”, line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File “/data/aman/envs/attenuate/lib/python3.12/site-packages/torch/nn/modules/module.py”, line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File “/data/aman/scripts/attenuate/loss.py”, line 344, in forward
[rank1]:     raise ValueError(“Input waveforms contain NaN or Inf values”)
[rank1]: ValueError: Input waveforms contain NaN or Inf values
E1019 03:37:32.660000 124960817152128 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 375841) of binary: /data/aman/envs/attenuate/bin/python
Traceback (most recent call last):
File “”, line 198, in _run_module_as_main
File “”, line 88, in _run_code
File “/data/aman/envs/attenuate/lib/python3.12/site-packages/torch/distributed/run.py”, line 883, in
main()
File “/data/aman/envs/attenuate/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File “/data/aman/envs/attenuate/lib/python3.12/site-packages/torch/distributed/run.py”, line 879, in main
run(args)
File “/data/aman/envs/attenuate/lib/python3.12/site-packages/torch/distributed/run.py”, line 870, in run
elastic_launch(
File “/data/aman/envs/attenuate/lib/python3.12/site-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/data/aman/envs/attenuate/lib/python3.12/site-packages/torch/distributed/launcher/api.py”, line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: