I found out that if I pin the memory after I use share_memory_() inside a process, the memory would not be shared anymore because pin_memory() makes a copy. However, if I pin the memory in advance, it associates only one specific cuda device (default to be 0). In the mean time, you have to share_memory_ before using mp.spawn, so it seems like a dead end?
Is there way to make a share memory and pin simultaneously for multiple cuda devices?