Tensor.to, 'non-blocking' & cuda streams optimisation

Hi All,

Thank you!

Thanks for all the hard work: making some incredibly powerful software, and helping newer members of the community (like me) to get to grips with it!

2 low-level questions

I’m hoping someone can help me get to the bottom of how

Tensor.to(device=gpu,non_blocking=True)

actually works. (I’ve tried searching the GitHub Repo, but it looks like a complex hierarchy of classes and implementation that has left me a bit bewildered - probably due to being very efficient & implemented in C/C++?)

  1. Where is the implementation for ‘Tensor.to’?
  2. and/or how does ‘non_blocking’ relate to cuda streams?

My motivation:

I’ve been reading some of the posts around cuda streams on the nvidia blog and want to implement my own strategy to optimise transfers from system RAM to GPU RAM, computation and then returning the results to the system RAM. I’ve got some code running and it is more efficient than without the streams, but there seems to be a lot more data transfer than I’m expecting, and I’m wondering if this is somehow linked with ‘non_blocking’.

Wider context again:

I want the transfers to be non-blocking to allow the CPU to queue up multiple transfers while the GPU is performing calculations, to maximise throughput.

To that end I envisage two strategies right now:

  • A) use one cuda stream for data-transfer, subsequent computation and return to the host, repeat many times in parallel (possibly using an event to wait for a computation to finish before starting the next to ensure that there are never too many kernels running in parallel that cause an OOM error)
  • B) use one stream for data-transfers and use events to synchronise prior to computation in another cuda stream (likely slower as computation would be forced to be more sequential, but as long as the queue is big enough, this would also ensure models do not start running in parallel causing OOM errors), then use another event to synchronise a return to the system RAM.

I would not be surprised to discover this has been implemented - I cannot seem to find ‘torch::deploy’ in the docs, so please forgive me if I’m missing something obvious!

2 high-level questions

  1. Would both strategies work with streams and ‘non_blocking’ transfers, or do ‘non_blocking’ transfers use a dedicated stream meaning there has to be more consideration of synchronisation?

  2. Any thoughts on other strategies I should consider?

Thanks!

I’ll quote cuda c programming guide:

3.2.6.4. Concurrent Data Transfers

Some devices of compute capability 2.x and higher can overlap copies to and from the device. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is equal to 2 for devices that support it. In order to be overlapped, any host memory involved in the transfers must be page-locked.

(3.2.5.3) Mapped memory

There is no need to use streams (see Concurrent Data Transfers) to overlap data transfers with kernel execution; the kernel-originated data transfers automatically overlap with kernel execution.

see also: 3.2.6.5.5. Overlapping Behavior

so, basically, x.pin_memory().to(…,non_blocking=True) corresponds to cudaMemcpyAsync, however pin_memory() has to copy data first, unless it is page locked in advance…

Hi Alex,

Thanks for your reply!

I haven’t found those guides before, so I’ll spend a while reading through in the next few days. I think they will provide a wealth of clues, but may lead to me struggling to relate the cuda concepts to pytorch code even more.

If anyone can provide an insight in to the relationships between cuda and pytorch (I guess there could be internal development docs?) that may short-cut a number of follow up questions.

relation? well, python frontend connects to c++ backend, the backend dispatches to device-specific modules, including cuda (nvcc compiled .cu). naturally, there is a lot of adaptation code in all layers.

Hi Alex,

Thanks again for that.

I guess there may be some kind of demarcation of “python frontend”, “c++ backend” and “device-specific modules” inside the source repo?

I have only just found the documentation on the C++ API, so I can begin reading that as well, and that will give me a better understanding of the underlying structure too by the look of it.

well, it is a big c++ codebase, they’re seldom easily approachable… apart from website docs, there are also some scattered readme.md files and some non-published memos in docs/source/, but it is easier to to search for something and then navigate from there.

e.g. Tensor::to is in aten/src/ATen/native/TensorConversions.cpp, I think execution will lead to aten/src/ATen/native/cuda/Copy.cu. in general, most of interesting stuff is under aten/src/ATen.

I realize this not a direct answer to your question but this tutorial by Ed Yang really helped me understand a bit more what’s going on Let’s talk about the PyTorch dispatcher : ezyang’s blog