Moving Tensors from gpu to cpu to gpu: time bottleneck

Hello,

I am currently working on a generative algorithm for discrete data (MCTS-like), based on Transformers, where I sample / extend tokens (nodes) to build a tree.
The Transformer is causal, so for speed purposes I tried to store in memory the hidden states of each previous node to not recompute them.

Doing so, I need to move them to the CPU in order to not run out of GPU memory. Here is the bottleneck, it’s very slow. I ran some benchmarks, here are the average time per iteration (I refer to an iteration as creating a new node and running a simulation):

  • reusing hidden states and storing them on the CPU: 9.4sec / it
  • reusing hidden states, keeping on GPU (until running OOM): 1.06sec / it
  • recomputing all the hidden states during each forward pass: 6.76sec / it

I ran this on a 2080S 8GB, CPU is an i7 10k something with 32GB of ram, sequences of 512 tokens with a 12 layer Transformer (about 38M params).
I was very surprised (and deceived) by these results: factor of 9 and just faster to recompute everything at each forward pass

Do you have some clues / possible solutions that could help ?
If nothing is possible I am considering limiting the tree size / memory usage. I am running OOM with ~140 nodes stored, but I would like to increase it to at least 200 (the more the better).
I did not tried FP16 yet as I would need to make some changes in the code to make it compatible.

Thank you in advance !

Using mixed-precision training sounds like a valid approach, which might save some memory depending on the used operations.
Yes, recomputing can often be cheaper than moving things around as the compute performance usually grows much faster than the memory bandwidth (it’s generally not easy to keep your GPU busy on modern architectures), which is also what e.g. torch.utils.checkpoint does.

1 Like

Hello @ptrblck,

With FP16 and minimal modifications, I reach OOM at ~ 360 nodes stored on GPU, with an average speed of 0.66 second / iteration. A great improvement! :tada:

I guess the best solution is to use fp16 coupled with a careful dynamic tree size management to stay within the limits of the GPU.

I am still concerned about the bandwidth vs recomputation timings. I plan to read the CUDA best practices guide to understand better what is behind this, I’ll take your read suggestions if you have some.

Thank you, your help is, as always, very much appreciated!

That sounds great!
I enjoyed this GTC 2022 - How CUDA Programming Works talk by Stephen Jones in which he explains more details about the physical limitations of the GPU, memory bandwidth, compute, occupancy etc. so it might also be interesting for you.

1 Like