Memory leak when saving a model with batchnorm

We observe a slight memory leak when a model with BATCHNORM layers is saved.
Leak is observed in the next epoch after which the model is saved.

Example : MobileNetV2 on ImageNette, model to be saved only after achieving 40% accuracy.
Leak observed at epoch 5-


    [INFO] Epoch: 1
    Memory allocated at_start:  11.4487304688 MB
             Training Step 1184/1184 :: Loss: 2.181 | Acc: 17.066% (1616/9469)
             Validation Step 491/491 :: Loss: 1.983 | Acc: 28.204% (1107/3925)
    Memory allocated at_end :  457.4116210938 MB
    diff :  445.962890625

    [INFO] Epoch: 2
    Memory allocated at_start:  457.4116210938 MB
             Training Step 1184/1184 :: Loss: 1.987 | Acc: 26.370% (2497/9469)
             Validation Step 491/491 :: Loss: 1.790 | Acc: 33.885% (1330/3925)
    Memory allocated at_end :  457.4116210938 MB
    diff :  0.0                                          <--- No difference hereafter

    [INFO] Epoch: 3
    Memory allocated at_start:  457.4116210938 MB
             Training Step 1184/1184 :: Loss: 1.886 | Acc: 31.144% (2949/9469)
             Validation Step 491/491 :: Loss: 1.681 | Acc: 37.376% (1467/3925)
    Memory allocated at_end :  457.4116210938 MB
    diff :  0.0

    [INFO] Epoch: 4
    Memory allocated at_start:  457.4116210938 MB
             Training Step 1184/1184 :: Loss: 1.819 | Acc: 35.020% (3316/9469)
             Validation Step 491/491 :: Loss: 1.585 | Acc: 45.783% (1797/3925)
    Memory allocated at_end :  457.4116210938 MB
    diff :  0.0                                          <--- Model saved here

    [INFO] Epoch: 5
    Memory allocated at_start:  457.4116210938 MB
             Training Step 1184/1184 :: Loss: 1.741 | Acc: 40.194% (3806/9469)
             Validation Step 491/491 :: Loss: 1.579 | Acc: 45.860% (1800/3925)
    Memory allocated at_end :  457.4370117188 MB
    diff :  0.025390625                                  <--- Leak observed

    [INFO] Epoch: 6
    Memory allocated at_start:  457.4370117188 MB
             Training Step 1184/1184 :: Loss: 1.640 | Acc: 44.218% (4187/9469)
             Validation Step 491/491 :: Loss: 1.314 | Acc: 56.739% (2227/3925)
    Memory allocated at_end :  457.4370117188 MB
    diff :  0.0

Calculations above represent torch.cuda.max_memory_allocated() value.

Model being saved as :

        if self.best_acc > self.min_acc:
            state = {
                'net': self.model.state_dict(), 'acc': accuracy,
                'epoch': epoch, 'opt': self.optimizer.state_dict(),
            }
            if self.scheduler is not None:
                state['scheduler'] = self.scheduler.state_dict()

            torch.save(state, self.state_filename)

Reporting this issue as we have our custom batchnorm implemented on top of torch’s batchnorm and it shows even more leakage.

torch.cuda.max_memory_allocated() returns the peak memory usage so it would be interesting to hear more about your debugging and how you’ve tried to infer a memory leak from it.
E.g. what did memory_summary() return and did you see “lost” allocations?

We found the reason of this slight increase. The state variable used for saving the checkpoint as above was not nullified during training epochs.
Adding state = None after saving checkpoint helped.

Issue observed only with models with batchnorm and potentially has to do with the buffers as, the same memory increase is not observed with track_running_stats=False.