Thread safety in Multiprocessing - CUDA tensors dont update asynchronously

Are CUDA tensors always updated synchronously? Is there some inherent mechanism for thread safety?
I am updating a tensor from multiple processes, but I don’t see errors due to lack of locking.

import torch, torch.multiprocessing as mp
n_processes, n_iterations = 4, 1_000_000
def fn(t):
    for _ in range(n_iterations): t[0] += 1
if __name__ == "__main__":
    context = mp.get_context('spawn')
    processes = []
    shared_array = torch.zeros(1, device = 'cpu')
    shared_array.share_memory_()
    for _ in range(n_processes):
        p = context.Process(target=fn, args=(shared_array,))
        processes.append(p)
        p.start()
    for process in processes: process.join()
    print(shared_array)

Output: tensor([3857068.]) which indicates lack of thread safety. (A thread safe output would be 4_000_000)
However, replacing shared_array = torch.zeros(1, device = 'cuda:0') results in the output tensor([4000000.], device='cuda:0') which indicates thread safety.
I observe this issue persistently over multiple runs on a Tesla V100, CUDA 11.6, PyTorch 1.10.2