I’m reading many small files (tensors) to be loaded to the GPU memory. In this case, can I open a thread pool and use .to to move all of them to GPU? I have 8 GPUs in total, each will take ~2000 such tensors, and this is done in python with a multiprocessing.pool.ThreadPool(32).
A more general question: is it safe to access the same GPU device through different threads but with thread local objects (still in python, e.g. using a thread pool) ? E.g. if I do x = torch.rand(2, 2) in two different threads and manipulate each thread’s x separately, will I possibly run into any errors / undefined behaviors?
BTW, if you only wanna avoid blocking Host-to-Device (H2D) copies, you can use pinned CPU memory and then set non_blocking=True in Tensor.to(), which will make the copy async.
E.g. if I do x = torch.rand(2, 2) in two different threads and manipulate each thread’s x separately, will I possibly run into any errors / undefined behaviors?
This should fine, because x in these two threads will use different CUDA memory blocks. However, one thing to note is that, Python GIL might kill perf for multi-threading. You should still be able to get some speedup though, as PyTorch drops GIL when entering the C++ side.