We observe a slight memory leak when a model with BATCHNORM layers is saved.
Leak is observed in the next epoch after which the model is saved.
Example : MobileNetV2 on ImageNette, model to be saved only after achieving 40% accuracy.
Leak observed at epoch 5-
[INFO] Epoch: 1
Memory allocated at_start: 11.4487304688 MB
Training Step 1184/1184 :: Loss: 2.181 | Acc: 17.066% (1616/9469)
Validation Step 491/491 :: Loss: 1.983 | Acc: 28.204% (1107/3925)
Memory allocated at_end : 457.4116210938 MB
diff : 445.962890625
[INFO] Epoch: 2
Memory allocated at_start: 457.4116210938 MB
Training Step 1184/1184 :: Loss: 1.987 | Acc: 26.370% (2497/9469)
Validation Step 491/491 :: Loss: 1.790 | Acc: 33.885% (1330/3925)
Memory allocated at_end : 457.4116210938 MB
diff : 0.0 <--- No difference hereafter
[INFO] Epoch: 3
Memory allocated at_start: 457.4116210938 MB
Training Step 1184/1184 :: Loss: 1.886 | Acc: 31.144% (2949/9469)
Validation Step 491/491 :: Loss: 1.681 | Acc: 37.376% (1467/3925)
Memory allocated at_end : 457.4116210938 MB
diff : 0.0
[INFO] Epoch: 4
Memory allocated at_start: 457.4116210938 MB
Training Step 1184/1184 :: Loss: 1.819 | Acc: 35.020% (3316/9469)
Validation Step 491/491 :: Loss: 1.585 | Acc: 45.783% (1797/3925)
Memory allocated at_end : 457.4116210938 MB
diff : 0.0 <--- Model saved here
[INFO] Epoch: 5
Memory allocated at_start: 457.4116210938 MB
Training Step 1184/1184 :: Loss: 1.741 | Acc: 40.194% (3806/9469)
Validation Step 491/491 :: Loss: 1.579 | Acc: 45.860% (1800/3925)
Memory allocated at_end : 457.4370117188 MB
diff : 0.025390625 <--- Leak observed
[INFO] Epoch: 6
Memory allocated at_start: 457.4370117188 MB
Training Step 1184/1184 :: Loss: 1.640 | Acc: 44.218% (4187/9469)
Validation Step 491/491 :: Loss: 1.314 | Acc: 56.739% (2227/3925)
Memory allocated at_end : 457.4370117188 MB
diff : 0.0
Calculations above represent torch.cuda.max_memory_allocated() value.
Model being saved as :
if self.best_acc > self.min_acc:
state = {
'net': self.model.state_dict(), 'acc': accuracy,
'epoch': epoch, 'opt': self.optimizer.state_dict(),
}
if self.scheduler is not None:
state['scheduler'] = self.scheduler.state_dict()
torch.save(state, self.state_filename)
Reporting this issue as we have our custom batchnorm implemented on top of torch’s batchnorm and it shows even more leakage.