Escaping if statement synchronization

sbelharbi · August 25, 2021, 3:42pm

**-- how to escape if synchronization. it is a huge problem because the evaluation of a simple if ONLY takes .33 sec while the entire forward in large network takes .0001s. here is a way that leads to wrong results because of synch gpu-to-cpu with non-blocking=True issue !!! WARNING: NEVER use gpu-to-cpu transfer with non-blocking=True. it is NOT safe. see the previous post. **

hi,
it seems that for a cuda tensors a and b, even with 1 element each, the if control will create a synchronization point such as in this example:

if a == b:
    return 0

it is not the condition a == b that creates the synchronization, it is the if. if seems to trigger a transfer of the result of the comparison to cpu in a blocking way making it similar to call torch.cuda.synchronize() before if.

pointers: here and here.

we will see here 2 examples: when it is if that evaluates the expression, and second example where we prepare the boolean value ourselves.

the example below is taken from my ‘real’ code. creating a snippet example wont show this issue.

i tried a solution using non_blocking=True but it GIVES THE WRONG RESULTS. this happens because the non-blcoking=True transfer. tensors are created in cpu filled with 0 but the transfer didnt finished yet while the cpu has already evaluated the condition. the doc does not seem to mention this possibility nor the transfer from gpu-2-cpu: Returns a Tensor with the specified device and (optional) dtype. If dtype is None it is inferred to be self.dtype. When non_blocking, tries to convert asynchronously with respect to the host if possible, e.g., converting a CPU Tensor with pinned memory to a CUDA Tensor.

this issue of transfering data from gpu to cpu will happen implicitly, but in safe mode == non-blocking=False, when you run a python operation , i.e.cpu, over cuda tensors, such as control statement, print, casting (int, …), calling cuda_tensor.item(), …

1. ‘if’ evaluates the condition:

example:

        x = x.detach()  # no gradient. x device: cuda.
        tx = time.perf_counter()
        min_x = x.min()
        max_x = x.max()
        print('time min max {}'.format(time.perf_counter() - tx))

        tx = time.perf_counter()
        if min_x == max_x:
            return min_x
        print('simple test Z {}'.format(time.perf_counter() - tx))

Output:

time min max 0.0002352111041545868
simple test Z 0.3392159678041935

2. we evaluate the condition and provide it to ‘if’:

in this way, we avoid the synchronization by performing the transfer from gpu to cpu with non_blocking=True. the doc says, that when this variable is true, the transfer is done asynchronously if possible.
the worst scenario is that you will endup doing a synch if resulting condition tensor is not ready.
but the issue, tensors on cpu are more likely to be wrong… because the copy is not finished yet while the cpu has already performed the evaluation…

example:

        tx = time.perf_counter()
        min_x = x.min()
        max_x = x.max()
        print('time min max {}'.format(time.perf_counter() - tx))

        tx = time.perf_counter()
        z = ((min_x - max_x) == 0).to(torch.device("cpu"), non_blocking=True)
        print('compute Z {}'.format(time.perf_counter() - tx))

        if z:
            return min_x
        print('simple test Z {}'.format(time.perf_counter() - tx))

Output:

time min max 0.00020601600408554077
compute Z 0.0007294714450836182
simple test Z 1.7356127500534058e-05

we get back to square one if we set non_blocking to false:

        tx = time.perf_counter()
        min_x = x.min()
        max_x = x.max()
        print('time min max {}'.format(time.perf_counter() - tx))

        if ((min_x - max_x) == 0).to(torch.device("cpu"), non_blocking=False):
            return min_x
        print('simple test Z {}'.format(time.perf_counter() - tx))

Output:

otsu time min max 0.00021892786026000977
otsu simple test Z 0.3317955397069454

3. you wont see this behavior in this snippet:

import time

import torch

if __name__ == '__main__':
    seed = 0
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    a = torch.rand(200, 200).to(device)

    min_ = a.min()
    max_ = a.max()


    t = time.perf_counter()
    if min_ == max_:
        pass
    print('time {}'.format(time.perf_counter() - t))

thanks

gcramer23 · August 25, 2021, 11:48pm

Hi why won’t the described behavior be seen in 3.

It seems as though this a python problem with the if statement because of the CPU.

I am unsure if this will help since I can’t reproduce the problem, but I will torch.equal work for 1? It outputs a bool pytorch/native_functions.yaml at 1be1c901aabd3ddcf55af3ee869e611b7f3f43b6 · pytorch/pytorch · GitHub. If it does for your if conditions you can write a GPU kernels.

sbelharbi · August 26, 2021, 12:04am

i think it has something to do with the kernels.
in my ‘real’ code, x is computed using some forward operations, kernels could take way more time to get them + the realtime load on the gpu at that moment. in the snippet code, i use torch.rand, i assume it is an easy task that we can get the results fast enough that subsequent operations wouldnt be blocked.

yes, the condition is evalauted on gpu and results in a boolean tensor. but in order for the cpu to access it, it ahs to transfer it to cpu.

i tried almost every imaginable way to get the value of the cuda boolean tensor without causing the blocking but it didnt work. this include torch.equal, indexing, evaluating the condition beforehand, numel(), … all this results in a blocking (synchronization) in order to make sure the cpu evaluates the right value of the tensor. if it is not the case, and the access to the tensor is not synchronized, cpu will produce wrong results.

it doesnt seem there is a way to escape this. lazy transfer from gpu-2-cpu is unsafe and leads to wrong results as mentioned in header parag.

ptrblck · August 26, 2021, 8:47am

Just skimmed through the topic and I would claim the synchronization for Python data-dependent control flow is expected, since the CPU has to read the values in order to decide which code path to take.
Since the operation might not have finished yet, the host needs to synchronize there.
The same would apply e.g. for print(data), as the host cannot (or rather should not) print the values of a CUDATensor until the result was created.

udi · August 31, 2021, 7:03am

@sbelharbi
Skimmed through your question. Seems to me that everything is working properly, and the problem is that you don’t (yet) have a correct mental model of what’s going on under the hood.

Let’s start with the bad news - I am quite certain that your forward pass takes much more time than you think. Probably as much time as you think the ‘if’ statement consumes. This time is just being (correctly) hidden from you by the CUDA system. What you have measured as your forward pass timing is just the time required to queue the work onto the CUDA driver and return to your CPU-based program execution.

The good news is that nothing weird is happening and everything you are trying to do is perfectly achievable. With extra work, you can also implement fine grained synchronization avoiding brute force torch.cuda.synchronize(), which is the concurrency equivalent of using a nuke in the battlefield.

Of course you will have to deal with the real run-time of the computation.

Unfortunately the Torch documentation is extremely sparse in regard to synchronization issues and mostly assumes that users are either:

naive, sequential CPU-dwellers who transfer all data to the GPU and then do all computation on the GPU from there on
experienced, scarred CUDA programmers who know everything there is to know about asynchronous kernel execution.

Your case seems to fall in between these categories so you have been seeing what seems like concurrency voodoo.

The issue of synchronization and async execution is much too big for a post here so I will just give a few basic principles, hopefully these and what I’ve written above will give enough information to seek a solution:

Every time you execute a torch operation on the GPU, it is “non-blocking”; in the sense that GPU is told to do some work in the future, and execution resumes on the Python CPU thread almost immediately. This frees up your Python program to keep generating future GPU work, or to prepare new data while the GPU is doing its thing.
The reason this asynchronous operation doesn’t cause immediate mayhem is that work is queued on something called a CUDA Stream; on the stream everything proceeds in the order you have queued it. The reason most torch users are not aware of this is that there is a “default stream”, and unless the user requests something else, all GPU operations are queued on it, making everything look sequential and calm, as long as your tensors remain in CUDA land.
All of this Stream business is a CUDA-only thing. Therefore it applies only to stuff happening on the ‘cuda’ device. On the CPU there are is no such thing as streams and everything is simply synchronous with your Python program, just as most users expect.
So we have the GPU kingdom where everything is sequential because it’s on the default stream, and the CPU realm where everything is sequential because it is Python program synchronous. Each of these two domains maintains its own order, and they are largely independent of each other. This of course leaves the border between them as the problem area.
By default when you send a tensor across the boundary (nonblocking=False), the operation is synchronous on both sides, i.e. the copy is queued on the default CUDA stream AND your Python program is blocked until the tensor is completely moved to the other domain. This is nice and safe. And wastes precious time. This is why torch came up with nonblocking=True.
When you use nonblocking transfers, the copy work is queued on the default CUDA stream but the CPU is free to proceed. If you are doing cpu->GPU transfers, then most likely what happens next is that your Python program will queue further GPU operations for the tensor you have just uploaded. This is ok because the compute work is queued on the same stream with the copy work, and placed after it, so everything will end up blissfully fine.
However, when you move data from the GPU to the CPU with nonblocking=True, there is nothing there to protect you implicitly. It will queue the copy on the CUDA default stream, but you Python program is free to proceed. Quite likely your GPU is still busy doing computations that you have previously queued, and hasn’t even “heard” of the CPU-side tensor you are expecting to do further compute on. Therefore that tensor is going to contain zeros or garbage when your CPU starts crunching it.

To properly use a GPU result on the CPU side with a non-blocking transfer, you will need to make sure that the copy to the CPU has completed before you start consuming the data.

Before discussing how this can be accomplished, you have to ask yourself whether this is at all useful to you. In many (simple) cases there is nothing useful for the CPU to do until the compute result from the GPU is available. If this is the case, the easiest way to ensure the copy is complete is to do a blocking transfer (set nonblocking=False). Since the copy work is queued on the default CUDA stream, the copy will begin after all computation on the data has finished. Since Python is blocked for it, you CPU computation will only proceed after the copy has finished. This is also when you need to stop the timer if you want to measure how much time the GPU computation really took.

If you decide that you need more advanced asynchronicity in you program, I advise you to study some more about CUDA kernel launching. Keep in mind that every torch operation on Tensors on the GPU is equivalent to a sequence of one or more CUDA kernel launches in succession. You will need to be familiar with the following CUDA concepts and their torch interfaces: Stream, default stream, Event, overlapped computation and data transfer.

I hope this helps.

udi · August 31, 2021, 7:22am

Just adding something important I realized I haven’t said explicitly:

The reason why you “tried almost every imaginable way to get the value of the cuda boolean tensor without causing the blocking but it didnt work” is that the work for computing the min and max and possibly the comparison between them hasn’t completed yet. You NEED to wait for the GPU to finish computing before you can have a useful and correct result.

Yes, with nonblocking=False you can free up your CPU to proceed before the GPU is done computing, but you can’t expect the CPU to have a correct result to use, before the GPU produces it.

In other words, it’s not that the synchronization “wastes” time, it’s the GPU computation that takes time. The synchronization simply “cures your blindness” to this time.

wangchengtao · September 13, 2023, 8:26am

As you said,

And I also found that in Pytorch docs, " As an exception, several functions such as to() and copy_() admit an explicit non_blocking argument, which lets the caller bypass synchronization when it is unnecessary. Another exception is CUDA streams, explained below.".
So if we set nonblocking=True, also as you said,

I wonder that since another exception is CUDA streams, if we use a non-default stream for data transfer, even if we set nonblocking=False, is Python program also free to proceed?

udi · September 14, 2023, 1:35pm

Off the top of my head, based on intuition rather than knowledge, I would say no, with nonblocking=False Python should not be allowed to proceed.

When an operation is said to be “blocking” (= not nonblocking), any sane programmer’s expectation would be that it will block progress for the program until completed, regardless of the details of the operation. In this case, regardless what CUDA stream it was launched on.
So I would expect the PyTorch implementation to synchronize with whatever the current CUDA stream is, whether it is the default one or not.
If it is being indiscriminating, PyTorch may also wait on the default stream, or all streams in the process, but I think that’s unlikely and wasteful.

More generally, if you are using non-default streams, it means that you are (or want to be) a bit of a concurrency ninja.
In that case I would recommend not relying on any kind of implicit synchronization like nonblocking=False.
Instead, you should use explicit synchronization, by recording CUDA events on your CUDA streams and explicitly synchronizing to those events in your CPU (Python) program.
See the related documentation here.

wangchengtao · September 17, 2023, 12:57pm

I understand.
Thanks very much!