Hi,
A) Does the gpu2cpu copy(with pinned memory) run in Stream s2 or the default stream(stream 0)?
line1 with torch.cuda.stream(s2):
line2 s2.wait_stream(torch.cuda.current_stream())
line3 Tensor.to(device=cpu) # bocked here
.......
B)
line1 with torch.cuda.stream(s2):
line2 s2.wait_stream(torch.cuda.current_stream())
line3 Tensor = ...
line4 Tensor.to(device="cpu",non_blocking=True) <-
line5 TensorB=Tensor...
Does Tensor.to use Stream s2 in this case? Shall we insert stream synchronization before or after line4?