Do inplace operations have implicit synchronize inside?

My simple test code:

import time
import random

import torch
a = torch.ones((10000, 10000)).cuda()

stream_1 = torch.cuda.Stream()
stream_2 = torch.cuda.Stream()

with torch.cuda.stream(stream_1):
    for i in range(100):
        a.add_(1)

time.sleep(random.random() / 100)

with torch.cuda.stream(stream_2):
    a.zero_()
    # a = a + 1

torch.cuda.synchronize()
print(a.mean().item())

Using a.zero_(), the order of operations between screams is ensured, and the output is deterministically 0;
Using a = a + 1, the order of operations between screams is not deterministic (and is expected, because I’m using this code to understand cuda behavior).

The output result seems to indicate that inplace operations have implicit synchronization inside.

If you remove the sleep call you would get 99 (or a random number which would depend on the kernel execution order) so I don’t fully understand your claim:

What is your runtime environment? I always get 0 printed, on RTX 4090 GPU.

I’m using a 3090, but I doubt the actually used GPU is that important and would rather try to make sure your CPU is fast enough to schedule all kernels in different streams.

I use two servers to test the code, and the output is always 0. One with RTX 4090, and another with V100. Do you really get random output?