MPI Backend with GPU support

Did you hit any error when using CUDA-aware MPI backend? Based on past discussion, you might need to synchronize CUDA streams in the application code when using CUDA-aware MPI. BTW, is MPI the only option for you, or would Gloo backend work?