Today I learned, `torch.cuda.synchronize` will create a device context on device 0 if current device is not set:
import torch
torch.cuda.set_device(1) # comment it, then torch.cuda.synchronize() will create a new context on device 0
data = torch.zeros(1024, 1024, dtype=torch.float32, device="cuda:1")
torch.cuda.synchronize()
import subprocess
result = subprocess.run(["nvidia-smi"], check=True, capture_output=True)
print(result.stdout.decode("utf-8"))