When I train the RNN-T on my local machine without CUDA, everything seems to work. However, when I use CUDA, I have this error when we are computing the loss:
NotImplementedError: Could not run ‘torchaudio::rnnt_loss’ with arguments from the ‘CUDA’ backend. This could be because the operator doesn’t exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit Internal Login for possible resolutions. ‘torchaudio::rnnt_loss’ is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
I’ve seen that some people are facing the same issue with
torchvision and the solution was to compute the loss on the CPU instead of using the GPU. So instead of using the
rnnt_loss = torchaudio.functional.rnnt_loss( logits, transcript, output_len, transcript_len, )
I’m doing it as:
rnnt_loss = torchaudio.functional.rnnt_loss( torch.from_numpy(logits.detach().cpu().numpy()), torch.from_numpy(transcript.detach().cpu().numpy()), torch.from_numpy(output_len.detach().cpu().numpy()), torch.from_numpy(transcript_len.detach().cpu().numpy()), )
This solution worked on my local machine using CUDA and just one GPU. However, when I run it on an instance with multiple GPU’s then I have this
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument
torch.nn.parallel.DistributedDataParallel, and by making sure all
forwardfunction outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s
forwardfunction. Please include the loss function and the structure of the return value of
forwardof your module when reporting this issue (e.g. list, dict, iterable).
To try to find the “unused parameters”, I set
find_unused_parameters=True as suggested in the error. However, I had no useful information and all I had it was a warning message saying that everything seems to be OK and I don’t need to set
[W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
The ideal solution would be to run the
distributed GPU’s but I don’t know how. Any help or comments will be highly appreciate it. Thank you.