Is it safe to use `Tensor.cuda(non_blocking=True)` in a thread?

E.g. if I create a thread pool with subsequent operations on async data transfer from host to device (or any async cuda operations in general), will it be problematic? For (a more specific) example: can it happen that a thread local object (say by torch.rand(1_000_000_000).pin_memory()) is passed to async cuda copy then destroyed immediately upon thread deallocation before the async cuda copy call completes?

A related question: if I stick to multi-threading (not multi-processing), is it true that all threads will share the same cuda stream event handles (which I assume is thread-safe) ? In other words, if I just use those high level multi-threading APIs (e.g. ThreadPool/ThreadPoolExecutor) without messing around cuda stream APIs, it’ll be very hard to create some race conditions or unsafe situations?

Context: I’m testing IO efficiency with various settings (threadpool, processpool, dataloader workers, etc) and wanted to make sure my code does not have any obvious bugs (which can be a headache when mixed with concurrent + async + mixed CPU/GPU operations here and there).