RAM memory consumption in compressed models

fnak · January 14, 2022, 2:08pm

I am running the following compression/optimization algorithms on my model:
1- pruning
2- fusion
3- quantization

I am using a profiler to monitor the memory usage on the models as follows:

prof = torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU],
        on_trace_ready=torch.profiler.tensorboard_trace_handler(logPath),
        profile_memory=True,
        record_shapes=True,
        with_stack=True)

prof.start()

for image, label in test_loader:

    with torch.no_grad():
        output, min_distances = model(image)
    all_preds = torch.cat((all_preds, output)  ,dim=0)
    all_labels = torch.cat((all_labels, label) ,dim=0)

    prof.step()

prof.stop()

Weirdly enough, I am monitoring the peak memory (RAM) usage of the aforementioned models (originally trained model, pruned model, fused and quantized model) and it seems to get higher in every single one.

While I can see the file size difference and inference speed of the quantized model, it does however have a higher peak memory usage.

Question 1: Is this a normal behavior? because I understand that pruning in pytorch might not have any effect on the model, but quantization should.
Question 2: Am I measuring memory (RAM) usage correctly? or is there a better way to do it?

Thanks