When will `torch.jit.fork` create new cuda stream?

I expect there will be two cuda streams to execute these two interaction, but nsys shows this two interactoin execute serially. I wonder how torch.jit.fork is implemented, and when will it create two cuda stream to execute kernels?

totally_local_future = torch.jit.fork(self.interaction, data_local)
data_ghost = self.interaction(data_ghost)
data_local = torch.jit.wait(totally_local_future)