Memory leak when saving a model with batchnorm

_singh · November 26, 2021, 7:13am

We observe a slight memory leak when a model with BATCHNORM layers is saved.
Leak is observed in the next epoch after which the model is saved.

Example : MobileNetV2 on ImageNette, model to be saved only after achieving 40% accuracy.
Leak observed at epoch 5-


    [INFO] Epoch: 1
    Memory allocated at_start:  11.4487304688 MB
             Training Step 1184/1184 :: Loss: 2.181 | Acc: 17.066% (1616/9469)
             Validation Step 491/491 :: Loss: 1.983 | Acc: 28.204% (1107/3925)
    Memory allocated at_end :  457.4116210938 MB
    diff :  445.962890625

    [INFO] Epoch: 2
    Memory allocated at_start:  457.4116210938 MB
             Training Step 1184/1184 :: Loss: 1.987 | Acc: 26.370% (2497/9469)
             Validation Step 491/491 :: Loss: 1.790 | Acc: 33.885% (1330/3925)
    Memory allocated at_end :  457.4116210938 MB
    diff :  0.0                                          <--- No difference hereafter

    [INFO] Epoch: 3
    Memory allocated at_start:  457.4116210938 MB
             Training Step 1184/1184 :: Loss: 1.886 | Acc: 31.144% (2949/9469)
             Validation Step 491/491 :: Loss: 1.681 | Acc: 37.376% (1467/3925)
    Memory allocated at_end :  457.4116210938 MB
    diff :  0.0

    [INFO] Epoch: 4
    Memory allocated at_start:  457.4116210938 MB
             Training Step 1184/1184 :: Loss: 1.819 | Acc: 35.020% (3316/9469)
             Validation Step 491/491 :: Loss: 1.585 | Acc: 45.783% (1797/3925)
    Memory allocated at_end :  457.4116210938 MB
    diff :  0.0                                          <--- Model saved here

    [INFO] Epoch: 5
    Memory allocated at_start:  457.4116210938 MB
             Training Step 1184/1184 :: Loss: 1.741 | Acc: 40.194% (3806/9469)
             Validation Step 491/491 :: Loss: 1.579 | Acc: 45.860% (1800/3925)
    Memory allocated at_end :  457.4370117188 MB
    diff :  0.025390625                                  <--- Leak observed

    [INFO] Epoch: 6
    Memory allocated at_start:  457.4370117188 MB
             Training Step 1184/1184 :: Loss: 1.640 | Acc: 44.218% (4187/9469)
             Validation Step 491/491 :: Loss: 1.314 | Acc: 56.739% (2227/3925)
    Memory allocated at_end :  457.4370117188 MB
    diff :  0.0

Calculations above represent torch.cuda.max_memory_allocated() value.

Model being saved as :

        if self.best_acc > self.min_acc:
            state = {
                'net': self.model.state_dict(), 'acc': accuracy,
                'epoch': epoch, 'opt': self.optimizer.state_dict(),
            }
            if self.scheduler is not None:
                state['scheduler'] = self.scheduler.state_dict()

            torch.save(state, self.state_filename)

Reporting this issue as we have our custom batchnorm implemented on top of torch’s batchnorm and it shows even more leakage.

ptrblck · November 29, 2021, 4:00am

torch.cuda.max_memory_allocated() returns the peak memory usage so it would be interesting to hear more about your debugging and how you’ve tried to infer a memory leak from it.
E.g. what did memory_summary() return and did you see “lost” allocations?

_singh · November 29, 2021, 8:24am

We found the reason of this slight increase. The state variable used for saving the checkpoint as above was not nullified during training epochs.
Adding state = None after saving checkpoint helped.

Issue observed only with models with batchnorm and potentially has to do with the buffers as, the same memory increase is not observed with track_running_stats=False.