**-- how to escape if
synchronization. it is a huge problem because the evaluation of a simple if
ONLY takes .33 sec while the entire forward in large network takes .0001s. here is a way that leads to wrong results because of synch gpu-to-cpu with non-blocking=True issue !!! WARNING: NEVER use gpu-to-cpu transfer with non-blocking=True. it is NOT safe. see the previous post. **
hi,
it seems that for a cuda tensors a
and b
, even with 1 element each, the if control will create a synchronization point such as in this example:
if a == b:
return 0
it is not the condition a == b
that creates the synchronization, it is the if
. if
seems to trigger a transfer
of the result of the comparison to cpu in a blocking way making it similar to call torch.cuda.synchronize()
before if
.
we will see here 2 examples: when it is if
that evaluates the expression, and second example where we prepare the boolean value ourselves.
the example below is taken from my ârealâ code. creating a snippet example wont show this issue.
i tried a solution using non_blocking=True
but it GIVES THE WRONG RESULTS. this happens because the non-blcoking=True transfer. tensors are created in cpu filled with 0 but the transfer didnt finished yet while the cpu has already evaluated the condition. the doc does not seem to mention this possibility nor the transfer from gpu-2-cpu: Returns a Tensor with the specified device and (optional) dtype. If dtype is None it is inferred to be self.dtype. When non_blocking, tries to convert asynchronously with respect to the host if possible, e.g., converting a CPU Tensor with pinned memory to a CUDA Tensor.
this issue of transfering data from gpu to cpu will happen implicitly, but in safe mode == non-blocking=False, when you run a python operation , i.e.cpu, over cuda tensors, such as control statement, print, casting (int, âŚ), calling cuda_tensor.item(), âŚ
1. âifâ evaluates the condition:
example:
x = x.detach() # no gradient. x device: cuda.
tx = time.perf_counter()
min_x = x.min()
max_x = x.max()
print('time min max {}'.format(time.perf_counter() - tx))
tx = time.perf_counter()
if min_x == max_x:
return min_x
print('simple test Z {}'.format(time.perf_counter() - tx))
Output:
time min max 0.0002352111041545868
simple test Z 0.3392159678041935
2. we evaluate the condition and provide it to âifâ:
in this way, we avoid the synchronization by performing the transfer from gpu to cpu with non_blocking=True
. the doc says, that when this variable is true, the transfer is done asynchronously if possible.
the worst scenario is that you will endup doing a synch if resulting condition tensor is not ready.
but the issue, tensors on cpu are more likely to be wrong⌠because the copy is not finished yet while the cpu has already performed the evaluationâŚ
example:
tx = time.perf_counter()
min_x = x.min()
max_x = x.max()
print('time min max {}'.format(time.perf_counter() - tx))
tx = time.perf_counter()
z = ((min_x - max_x) == 0).to(torch.device("cpu"), non_blocking=True)
print('compute Z {}'.format(time.perf_counter() - tx))
if z:
return min_x
print('simple test Z {}'.format(time.perf_counter() - tx))
Output:
time min max 0.00020601600408554077
compute Z 0.0007294714450836182
simple test Z 1.7356127500534058e-05
we get back to square one if we set non_blocking
to false:
tx = time.perf_counter()
min_x = x.min()
max_x = x.max()
print('time min max {}'.format(time.perf_counter() - tx))
if ((min_x - max_x) == 0).to(torch.device("cpu"), non_blocking=False):
return min_x
print('simple test Z {}'.format(time.perf_counter() - tx))
Output:
otsu time min max 0.00021892786026000977
otsu simple test Z 0.3317955397069454
3. you wont see this behavior in this snippet:
import time
import torch
if __name__ == '__main__':
seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
a = torch.rand(200, 200).to(device)
min_ = a.min()
max_ = a.max()
t = time.perf_counter()
if min_ == max_:
pass
print('time {}'.format(time.perf_counter() - t))
thanks