My simple test code:
import time
import random
import torch
a = torch.ones((10000, 10000)).cuda()
stream_1 = torch.cuda.Stream()
stream_2 = torch.cuda.Stream()
with torch.cuda.stream(stream_1):
for i in range(100):
a.add_(1)
time.sleep(random.random() / 100)
with torch.cuda.stream(stream_2):
a.zero_()
# a = a + 1
torch.cuda.synchronize()
print(a.mean().item())
Using a.zero_()
, the order of operations between screams is ensured, and the output is deterministically 0
;
Using a = a + 1
, the order of operations between screams is not deterministic (and is expected, because I’m using this code to understand cuda behavior).
The output result seems to indicate that inplace operations have implicit synchronization inside.