I run out of memory after a certain amount of batches when training a resnet18

lalord · April 16, 2017, 7:31pm

Hi, I am running a slightly modified version of resnet18 (just added one more convent and batchnorm layers at the beginning of the network). When I start iterating over my dataset it starts training fine, but after some iterations I run out of memory. If I reduce the batch size, training runs some for more iterations, but it always ends up running out of memory.

Could you help me find my memory leak?

Oh, one more thing, if I select one batch and always iterate over the same batch, the network runs just fine, so it seems to be a problem with the dataloader. It seems to keep references to memory which arent getting cleaned up or something like that.

  def train():
      second_convnet = lalo.resnet2.resnet18(pretrained=False)
      if os.path.isfile(CHECKPOINT_OUTPUT_FILE):
          checkpoint = torch.load(CHECKPOINT_OUTPUT_FILE)
          second_convnet.load_state_dict(checkpoint)
          print("Checkpoint found, continuing with training...")
      else:
          print("No checkpoint found, training from scratch...")
      second_convnet.cuda()
      second_convnet.train()
      criterion = torch.nn.CrossEntropyLoss().cuda()
      learning_rate = 0.1
      momentum = 0.9
      weight_decay = 1e-4
      optimizer = torch.optim.SGD(second_convnet.parameters(), learning_rate,
                                  momentum=momentum,
                                  weight_decay=weight_decay)

      for i, (input, target) in enumerate(data_loader(PROCESSED_FOLDERS['training'], BATCH_SIZE)):
          output = second_convnet(input)
          loss = criterion(output, target)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
          print("Batch {} processed succesfully".format(i))

  def data_loader(folder, batch_size):
      """ Our dataset is very unbalanced, so I am forcing our data loader
          to load the same amount of positive and negative samples.
      """
      patients_list = []
      labels_list = []
      while True:
          for _ in range(batch_size):
              label = random.choice(['0', '1'])
              patient_ids = os.listdir(os.path.join(folder, label))
              patient_id = random.choice(patient_ids)
              patient_path = os.path.join(folder, label, patient_id)
              patients_list.append(torch.load(patient_path))
              label = torch.Tensor([int(label)])
              labels_list.append(label)

          batch_variable = Variable(torch.stack(patients_list),
                                    requires_grad=False)
          batch_labels = torch.squeeze(Variable(torch.stack(labels_list),
                                                requires_grad=False))

          yield batch_variable, batch_labels.long().cuda()

smth · April 16, 2017, 9:00pm

i dont think it’s a memory. You might be creating reference cycles, and python cant deallocate your Variables without a garbage collection pass.

Try adding this line to your training for loop:

import gc
gc.collect()

Let me know if that works.

lalord · April 16, 2017, 9:42pm

Hey thanks for the answer. Tried adding that line in the loop, but I still get out of memory after 3 iterations.

RuntimeError: cuda runtime error (2) : out of memory at /b/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu:66

(I added the line after optimizer.step())

smth · April 16, 2017, 9:44pm

then maybe you are holding onto some Variable for some reason?
Try adding the lines:

del input, target

right after

 print("Batch {} processed succesfully".format(i))

lalord · April 16, 2017, 9:58pm

Still crashes after 3 iterations.

Used htop and nvidia-smi to see where the memory problem was, and its definitely in the gpu. Its wierd cause if I dont use my data_loader function, and instead train over the same batch over and over manually, the model trains just fine.

smth · April 16, 2017, 10:00pm

I see. I wonder what is up. Can you still use your DataLoader, but make it return CPU Tensor, but cast it to CUDA and wrap it in a Variable inside the training loop?
I wonder if enumerate (or the iterator) queue up future iterations ahead of time.

lalord · April 16, 2017, 10:25pm

Hey I found the bug. I wasn’t clearing the lists in which I stored the batches properly!

In my data_loader:

  def data_loader(folder, batch_size):
      """ Our dataset is very unbalanced, so I am forcing our data loader
          to load the same amount of positive and negative samples.
      """
      patients_list = []  # Mistake!
      labels_list = []  # Mistake!
      while True:
          for _ in range(batch_size):
              label = random.choice(['0', '1'])
              patient_ids = os.listdir(os.path.join(folder, label))
              patient_id = random.choice(patient_ids)
              patient_path = os.path.join(folder, label, patient_id)
              patients_list.append(torch.load(patient_path))
              label = torch.Tensor([int(label)])
              labels_list.append(label)

          batch_variable = Variable(torch.stack(patients_list),
                                    requires_grad=False)
          batch_labels = torch.squeeze(Variable(torch.stack(labels_list),
                                                requires_grad=False))

          yield batch_variable, batch_labels.long().cuda()

patients_list = [] and labels_list = [] should be inside the while loop! Else I would be appending to the list infinitely.

Thanks for taking the time to helping me debug this!

Jiqing_Zhang · December 18, 2018, 3:32am

Hi, I had the same problem,

s_list = []
            for i in range(x1.size()[1]):
                x3 = x1[:,i,:,:].unsqueeze(1)
                x = self.up(x3)
                s_list.append(x)

self.up is a pretrained network.
so could you give me some suggestions?

ptrblck · December 18, 2018, 11:30am

If you would like to store x in s_list and don’t need to call backward on these tensors in the future, you should store it using s_list.append(x.detach()), since otherwise the computation graph will be stored with each x, which eventually might use all your memory.

Jiqing_Zhang · December 19, 2018, 12:54pm

Thank you very much!! It works.