How can I obtain GPU utilization/memory metrics during training in a fast way?

Hello everyone, I’ve been thinking on how to obtain GPU performance metrics in the fast way, however, I couldn’t find a way to do so (I’m training ResNet-50 on CIFAR-10, nothing fancy)

The function I use to get GPU utilization and memory utilization values is the following :

def collect_gpu_statistics():
    gpu_statistics = !nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
    gpu_util, memory_util = [int(x) for x in gpu_statistics[1].split(' ')[::2]]
    return gpu_util, memory_util

and I call it like that in the training loop :

for train_step, (image, label) in enumerate(train_dataloader):
        optimizer.zero_grad()
        image = image.cuda()
        label = label.cuda()
        
        prediction = model(image)
        loss = criterion(prediction, label.squeeze())
        loss.backward()
        optimizer.step()
        scheduler.step()
        gpu_util, memory_util = collect_gpu_statistics()
        ### do something with these metrics
        
        ## since the reduction is happening in CrossEntropyLoss itself
        average_meter.update(loss.item(), 1)

However, it slows down my code from 16s per epoch to >40s per epoch, which is unacceptable
Any best practices on this topic?

P.S.
I’m using jupyter-notebook for the training, that’s why you see “!nvidia-smi …” in collect_gpu_statistics() function

Memory stats are available from Expose `cudaMemGetInfo` by coreylammie · Pull Request #58635 · pytorch/pytorch · GitHub