Is is thread-safe to do `Tensor.to(device)` from multiple threads to the same GPU device?

I’m reading many small files (tensors) to be loaded to the GPU memory. In this case, can I open a thread pool and use .to to move all of them to GPU? I have 8 GPUs in total, each will take ~2000 such tensors, and this is done in python with a multiprocessing.pool.ThreadPool(32).

A more general question: is it safe to access the same GPU device through different threads but with thread local objects (still in python, e.g. using a thread pool) ? E.g. if I do x = torch.rand(2, 2) in two different threads and manipulate each thread’s x separately, will I possibly run into any errors / undefined behaviors?

It should be safe, as long as you use CUDA stream/events to make sure that subsequent ops do not access the data before it’s ready. See the doc below.

https://pytorch.org/docs/stable/generated/torch.cuda.Event.html

BTW, if you only wanna avoid blocking Host-to-Device (H2D) copies, you can use pinned CPU memory and then set non_blocking=True in Tensor.to(), which will make the copy async.

E.g. if I do x = torch.rand(2, 2) in two different threads and manipulate each thread’s x separately, will I possibly run into any errors / undefined behaviors?

This should fine, because x in these two threads will use different CUDA memory blocks. However, one thing to note is that, Python GIL might kill perf for multi-threading. You should still be able to get some speedup though, as PyTorch drops GIL when entering the C++ side.

1 Like

Thank you for the education! These completely answers my questions.