I create many modules while doing federated learning, and save them in the RAM(cpu), only move to cuda when training and evaluating.
Noticed that the memory was not freed when moving the model to cuda.
Line # Mem usage Increment Occurrences Line Contents
=============================================================
47 3719.9 MiB 3719.9 MiB 1 @profile
48 def train(self):
49 3719.9 MiB 0.0 MiB 1 self.model = self.model.to(Config.device)
50 3719.9 MiB 0.0 MiB 1 optimizer = torch.optim.Adam(self.model.parameters(), lr=Config.learning_rate)
51 3719.9 MiB 0.0 MiB 1 self.model.train()
Move the model back to cpu when training is complete
69 3719.9 MiB 0.0 MiB 6 for epoch in range(epochs):
70 3719.9 MiB 0.0 MiB 5 batch_loss_list = []
71 3719.9 MiB 0.0 MiB 60 for data in self.loader:
72 3719.9 MiB 0.0 MiB 55 x = data[0].to(Config.device)
73 3719.9 MiB 0.0 MiB 55 y = data[1].to(Config.device)
74 3719.9 MiB 0.0 MiB 55 loss, y_ = self.train_batch(x, y)
75 3719.9 MiB 0.0 MiB 55 batch_loss_list.append(loss)
76 3719.9 MiB 0.0 MiB 5 mean_loss = np.mean(batch_loss_list)
77 3719.9 MiB 0.0 MiB 5 if mean_loss < Config.local_loss_threshold:
78 break
79 3719.9 MiB 0.0 MiB 5 self.schedule.step()
80 3719.9 MiB 0.0 MiB 5 self.logger.log_client_loss(self.client_id, epoch, np.mean(batch_loss_list).item())
81 3721.7 MiB 1.8 MiB 1 self.model = self.model.cpu()
82 3721.7 MiB 0.0 MiB 1 return loss
Each times move the model from cuda to cpu, 1.6-1.9MiB RAM would be allocated.
In the end, it consumed all my memory and crashed the program