Optimizing training (recording batch statistics)

During the training of deep learning models, we want to maximise the usage of GPUs.
Optimally we want GPU usage to be at 100% at all times.

I’m going to follow up from my previous post Recording loss history without I/O and taking reference from https://www.sagivtech.com/2017/09/19/optimizing-pytorch-training-code/ (don’t worry bout not opening them)

It is often beneficial to record training progress in deep learning pipelines. Typically we would display the training statistics at the end of every batch. To record the training history on CPU is suboptimal due to the wait time needed for the transfer of data from GPU to CPU after every training step. Below is an example of bad training practice.

# We assume that loss_history is an array
# and loss is a cuda tensor with size of [1]
loss_history.append(loss.item())

Why? Because loss.item() requires the value to be passed from GPU to CPU. This takes time, which otherwise could have been used to run inference and backprop.

I would like to ask for the proper way to collate training statistics to optimize GPU usage.
Thanks in advance!