Are CUDA tensors always updated synchronously? Is there some inherent mechanism for thread safety?
I am updating a tensor from multiple processes, but I don’t see errors due to lack of locking.
import torch, torch.multiprocessing as mp
n_processes, n_iterations = 4, 1_000_000
def fn(t):
for _ in range(n_iterations): t[0] += 1
if __name__ == "__main__":
context = mp.get_context('spawn')
processes = []
shared_array = torch.zeros(1, device = 'cpu')
shared_array.share_memory_()
for _ in range(n_processes):
p = context.Process(target=fn, args=(shared_array,))
processes.append(p)
p.start()
for process in processes: process.join()
print(shared_array)
Output: tensor([3857068.])
which indicates lack of thread safety. (A thread safe output would be 4_000_000)
However, replacing shared_array = torch.zeros(1, device = 'cuda:0')
results in the output tensor([4000000.], device='cuda:0')
which indicates thread safety.
I observe this issue persistently over multiple runs on a Tesla V100, CUDA 11.6, PyTorch 1.10.2