I expect there will be two cuda streams to execute these two interaction
, but nsys shows this two interactoin
execute serially. I wonder how torch.jit.fork
is implemented, and when will it create two cuda stream to execute kernels?
totally_local_future = torch.jit.fork(self.interaction, data_local)
data_ghost = self.interaction(data_ghost)
data_local = torch.jit.wait(totally_local_future)