Should we set non_blocking to True?

facing similar issue.

it looks like setting non_blocking=True when going from gpu to cpu does not make much sens if you intend to use data right away because it is not safe.
in the other way around, cuda kernel will wait for the transfer to end to start computing on gpu.
but when going from gpu to cpu, it is the cpu that will compute. and it does not seem to be aware of the transfer. tensor are created on cpu probably with zero values, but the transfer did not finish yet. for the cpu, tensors are already there, so it starts computing… with the wrong values. cpu will know that the transfer is done only when explicitly asks cuda using torch.cuda.synchronize() for instance.

@ptrblck any insights on how to make transfer gpu-to-cpu safe while being fast, ie non-blocking to True? thanks

reading other posts, and it seems that copying from gpu-to-cpu in non-blocking=True could be a huge risk unless you are planning to use the tensors long time after the CALL for transfer which is expected to finish by the time you want to access the data. the same thing when doing cpu-to-gpu. in that case, it is cuda that will block the gpu from using the data if it is not ready yet as mentioned somewhere in this thread. asynchronous transfer is like background threads… if you intend to access the results of the transfer before the threads end their job, you may use the wrong data. this aspect does not seem to be controlled on the cpu side…

example:

        import time
        # ....
        # x: cuda tensor
        min_x = x.min()
        max_x = x.max()

        t = (min_x - max_x).to(torch.device("cpu"), non_blocking=True)
        print(t)
        time.sleep(2.)
        print(t)

output:

tensor(0.)
tensor(-254.)  # the right value: min_x = 0, max_x= 254, t = 0 - 254 = -254.

so, no to gpu-to-cpu transfer with non-blocking=true unless you intend to use the transferred data very later on. and even than, you wont be sure if the transfer has been done yet or not.

note that python print creates also a synchronization point to move the tensor to cpu first before accessing its content. but, because the lazy transfer has already created the tensor in cpu, print just reads its -false- content.

7 Likes