Deterministic PRNG across CPU + CUDA?

What is a fast deterministic PRNG for torch that gives near identical results on CUDA + CPU?

Unfortunately, torchcsprng is a bottleneck in our code, and we are looking for other deterministic, non-crypto-secure pseudo-random number generation. However, even CUDA and CPU give different results:

import torch
torch.use_deterministic_algorithms(True)
torch.cuda.manual_seed(0)
torch.rand(3,3, device="cuda")

tensor([[0.3990, 0.5167, 0.0249],
        [0.9401, 0.9459, 0.7967],
        [0.4150, 0.8203, 0.2290]], device='cuda:0')
import torch
torch.use_deterministic_algorithms(True)
torch.manual_seed(0)
torch.rand(3,3)

tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556]])

Related to: Reproducibility over Different Machines