Why does the following give me a concurrency error?

    with torch.cuda.stream(s):
        w = sample_indices.cpu().numpy()
    q = action_probs.detach().cpu().numpy()
    s.synchronize()

Later when I try to access w I get an out of bounds error which does not happen if I do the .cpu() transfers on the same stream. Looking at the value of w, it is clearly corrupted. Is the transfer to host not being synchronized for some reason?

I forgot to wait on the default stream. s.wait_stream(torch.cuda.default_stream()) fixes it.