Parallel grad on different cuda stream

When I do this:

    streams = [torch.cuda.Stream() for i in range(elemsize)]
    for ii in range(elemsize):
        streams[ii].wait_stream(torch.cuda.current_stream())
    for ii in range(elemsize):
        loc = [0 for jj in range(len(shp))]
        n = ii
        for jj in range(len(shp)):
            loc[jj] = n//accu[jj]
            n = n%accu[jj]
        with torch.cuda.stream(streams[ii]):
            print(ii)
            NormMat[(...,)+tuple(loc)] = torch.autograd.grad(Norm[(0,0)][tuple(loc)], B_grad, create_graph=False, retain_graph=True)[0]
            print("end", ii)

I want to parallelize the slice and copy operation on gpu.
However, this code will run sequentially.
I’m not sure how to solve this, seems like there’re some sync inside autograd…

Thanks!

Actually, seems like it works…