Performance optimization re: CPU-GPU synchronization

Per the tuning guide section on CPU-GPU synchronization we should try to allow the CPU to run ahead of the GPU and avoiding tensor.to(device) calls is a good idea. I assume one good practice is to use non_blocking=True in the to(device) calls.

  1. Is there ever a circumstance when one should avoid non_blocking=True? or is it False by default just for legacy reasons?
  2. Besides non_blocking=True, for an image dataset (to be clear, too large to all fit onto a GPU tensor) which uses data augmentation and a dataloader with random sampling, where is the best place to run the to(device) call? Is it just when iterating through the data batches in the training / inference loops, or is there some opportunity for optimization by running it elsewhere?
  1. Your would use page-locked (pinned) memory on the host, which is a limited resource. Once you use page-locked memory your OS won’t be able to use it anymore and depending on your system you might be running into a situation where the OS starts to use the swap and would kill the performance of your system.

  2. In the common use case you would push the data to the GPU inside the DataLoader loop. Depending on your use case ans system you might want to check e.g. DALI which could yield a speedup.

Thanks for the DALI tip, I’ll have a look.

Regarding the page-locked memory on host:

  1. Is the concern that the memory used accumulates over time as more and more objects are placed into page-locked memory, or is the concern that a single object can be large enough to consume all the memory? Assuming the former but just verifying.
  2. If it’s the former, is there any option to tell the system to unlock the previously page-locked memory at some convenient moment?
  3. Does the memory hogging persist across different processes? Meaning if I run a model training script using page-locked memory, and the script stops running, should I expect the memory to then be released, or is it possible that it outlives the process that originated it?
  4. Is there any relationship between to(device) and specifying pin_memory=True in a DataLoader?
  1. If you are using pin_memory=True in the DataLoader, the page-locked memory should not increase as it should be used only for the current batches. The concern would be if you are generally allocating too much (e.g. a large number of workers, a huge batch size etc.) and/or if you are explicitly pinning memory on large tensors. In the end you should definitely try to use it, but should note that it could decrease the performance depending on your system which is why it’s not on my default.

  2. The DataLoader should do it by default. If you’ve pinned it via tensor = tensor.pin_memory() you would have to del it eventually or override it.

  3. Yes, the Python process would hold to the page-locked memory and it should be freed once the actual tensors are deleted or the Python process is stopped.

  4. Yes, using pin_memory=True will move the data tensors on the host into page-locked memory and the to('cuda') operation could then use non_blocking=True to allow for an async host to device transfer.