I have a question about tensor.to(non_blocking=True)

woongjoonchoi · January 19, 2024, 7:00pm

I saw topics on [Should we set non_blocking to True? - PyTorch Forums]
. I read answers and i saw post [[How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog]] How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog (add https:// i’m new user so , i can’t add another link)

I wonder if my understanding is correct .

Following code is from ptrblck’s answers.

for data, target in loader:
    # Overlapping transfer if pinned memory
    data = data.to('cuda:0', non_blocking=True)
    target = target.to('cuda:0', non_blocking=True)

    # The following code will be called asynchronously,
    # such that the kernel will be launched and returns control 
    # to the CPU thread before the kernel has actually begun executing
    output = model(data)  # has to wait for data to be pushed onto device (synch point)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

C2050Timeline-1024x670.png (1024×670) (nvidia.com)

i assume that async data transfer to gpu is available . Also, i use single gpu.

Q. Is the data splitted into small chunk and operation executed like async version 2 in picture?

From what I understand , batch_data is splitted in small chunk (nvidia’s async transfer). Then , kernel(in this code, output = model(data) ,)loss = criterion(output, target) is executed . loss.backward() optimizer.step()` i think backward is computed for batch_data if there is a batchnormalization . So ,In my opinion, backward should be sync point .
Also, I thinnk that optimize.step must be blocked to main process. Because, model must compute output after model parameter’s update.
What i undersatnd is correct?

I have another question
Q .
When GPu execute ouput calculate, loss calculate , backward , optimize by gpu , Cpu execute data transfer to gpu parallely( data = data.to('cuda:0', non_blocking=True) target = target.to('cuda:0', non_blocking=True) ) ?

In above picture, there are kernel engine and h2d engine. From what i understand, Kernel engine and h2d engine are separate. h2d engine just transfer data to device regardless of kernel engine. Kernel Engine wait until optimization step finish. Then , If data has been transferred to device already, Kernel Engine computes model ouputs. In Conclusion , DataLoaderPart(data transfer to gpu) and ModelComputationpart(ouput,loss,backward,optimize) are executed paralelly , but model output can be computed when data has been transferred to device already.

Thank you for seeing my topic