I am running the following compression/optimization algorithms on my model:
1- pruning
2- fusion
3- quantization
I am using a profiler to monitor the memory usage on the models as follows:
prof = torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU],
on_trace_ready=torch.profiler.tensorboard_trace_handler(logPath),
profile_memory=True,
record_shapes=True,
with_stack=True)
prof.start()
for image, label in test_loader:
with torch.no_grad():
output, min_distances = model(image)
all_preds = torch.cat((all_preds, output) ,dim=0)
all_labels = torch.cat((all_labels, label) ,dim=0)
prof.step()
prof.stop()
Weirdly enough, I am monitoring the peak memory (RAM) usage of the aforementioned models (originally trained model, pruned model, fused and quantized model) and it seems to get higher in every single one.
While I can see the file size difference and inference speed of the quantized model, it does however have a higher peak memory usage.
Question 1: Is this a normal behavior? because I understand that pruning in pytorch might not have any effect on the model, but quantization should.
Question 2: Am I measuring memory (RAM) usage correctly? or is there a better way to do it?
Thanks