[CUDA RPC] Incorrect results of GPU Tensor transferring using RPC when parallelized with other GPU programs

Issue Summary

I transfer a GPU tensor between two GPUs using PyTorch RPC, and find incorrect results when parallelized with other GPU programs, such as a pure computation task in the backend using PyTorch. When the other GPU task is stopped, the results become right.

This suggests that other GPU programs could be interfering with the CUDA Support RPC, leading to incorrect message transfers.

Could anyone provide some insight or guidance on how to prevent other GPU programs from interfering with the CUDA Support RPC and ensure correct message transfers? Any help or suggestions would be greatly appreciated.

More details to reproduce my results can be found in Issue