Per the tuning guide section on CPU-GPU synchronization we should try to allow the CPU to run ahead of the GPU and avoiding tensor.to(device)
calls is a good idea. I assume one good practice is to use non_blocking=True
in the to(device)
calls.
- Is there ever a circumstance when one should avoid
non_blocking=True
? or is it False
by default just for legacy reasons?
- Besides
non_blocking=True
, for an image dataset (to be clear, too large to all fit onto a GPU tensor) which uses data augmentation and a dataloader with random sampling, where is the best place to run the to(device)
call? Is it just when iterating through the data batches in the training / inference loops, or is there some opportunity for optimization by running it elsewhere?
Thanks for the DALI tip, I’ll have a look.
Regarding the page-locked memory on host:
- Is the concern that the memory used accumulates over time as more and more objects are placed into page-locked memory, or is the concern that a single object can be large enough to consume all the memory? Assuming the former but just verifying.
- If it’s the former, is there any option to tell the system to unlock the previously page-locked memory at some convenient moment?
- Does the memory hogging persist across different processes? Meaning if I run a model training script using page-locked memory, and the script stops running, should I expect the memory to then be released, or is it possible that it outlives the process that originated it?
- Is there any relationship between
to(device)
and specifying pin_memory=True
in a DataLoader?