`NotImplemtedError` when using 'torchaudio::rnnt_loss' on CUDA

When I train the RNN-T on my local machine without CUDA, everything seems to work. However, when I use CUDA, I have this error when we are computing the loss:

NotImplementedError: Could not run ‘torchaudio::rnnt_loss’ with arguments from the ‘CUDA’ backend. This could be because the operator doesn’t exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit Internal Login for possible resolutions. ‘torchaudio::rnnt_loss’ is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

I’ve seen that some people are facing the same issue with torchvision and the solution was to compute the loss on the CPU instead of using the GPU. So instead of using the rnnt_loss as:

        rnnt_loss = torchaudio.functional.rnnt_loss(
            logits,
            transcript,
            output_len,
            transcript_len,
        )

I’m doing it as:

        rnnt_loss = torchaudio.functional.rnnt_loss(
            torch.from_numpy(logits.detach().cpu().numpy()),
            torch.from_numpy(transcript.detach().cpu().numpy()),
            torch.from_numpy(output_len.detach().cpu().numpy()),
            torch.from_numpy(transcript_len.detach().cpu().numpy()),
        ) 

This solution worked on my local machine using CUDA and just one GPU. However, when I run it on an instance with multiple GPU’s then I have this RuntimeError:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

To try to find the “unused parameters”, I set find_unused_parameters=True as suggested in the error. However, I had no useful information and all I had it was a warning message saying that everything seems to be OK and I don’t need to set find_unused_parameters=True :

[W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

The ideal solution would be to run the rnnt_loss with distributed GPU’s but I don’t know how. Any help or comments will be highly appreciate it. Thank you.

Just to let you know, the problem was solved after updating torchaudio from 1.0.1+cpu to 1.0.2. Hope this helps.