It looks rather like a synchronization issue.
In your second example, pred.byte().any() > 0
would be computed on the GPU, thus it can be just enqueued and your second timer b
will be immediately called without waiting.
The if condition
should be executed on the CPU by the Python runtime, this it would create a synchronization point.