Hello,
I’m trying to fit a deeper model than my current one into gpu and run it. My current autoencoder model takes 4 GB of gpu memory as shown to me by nvidia-smi and GPUtil. The newer model I’m trying to fit takes no more than 6 GB as shown by nvidia-smi.
Everything seems to be running smoothly until my code reaches the loss.backward() statement. With 4 GB model and 3 GB of input, reconstructed and corrupted images tensors, GPU memory utilization stands around 7 GB just before that. Then when the code reaches loss.backward(), GPU memory usage balloons to 23.95 GB. And it stays there until the end of all epochs.
I have total 24 GB of GPU memory available. Hence, if I try to increase the number of filters in my CNN autoencoder model or increase number of layers, I get the following error:
RuntimeError: CUDA out of memory. Tried to allocate 3.74 GiB (GPU 2; 23.65 GiB total capacity; 22.79 GiB already allocated; 31.50 MiB free; 22.83 GiB reserved in total by PyTorch)
I can’t seem to figure out why loss.backward() would take such large amount of memory. I’m using Adam optimizer. One more interesting thing is that my current batch size is 100. If I increase it to 500 there doesn’t seem to be any large memory problem, hardly 1-2 GB of memory is used more. Hence the main problem causing memory ballooning is loss.backward(). I’m making sure that I clear past gradients with model.zero_grad(), I also delete the tensors from GPU after they’re useless and also perform torch.cuda.empty_cache(). Nothing seems to be making an impact.
One last thing I’d like to add is that I parallelized my autoencoder model on three similar GPUs, breaking it into three parts. Then when I do loss.backward() each GPU seems to be using 17 GB of storage. However as soon as I increase number of filters in the model by even a modest bit, I again get the memory errors. I’m using imagenet images as data (2242243).
Here’s my autoencoder model:
AutoEncoder(
(conv1): Conv2d(3, 256, kernel_size=(6, 6), stride=(4, 4), padding=(1, 1))
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 512, kernel_size=(6, 6), stride=(4, 4), padding=(1, 1))
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 10000, kernel_size=(14, 14), stride=(1, 1))
(bn3): BatchNorm2d(10000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(deconv1): ConvTranspose2d(10000, 512, kernel_size=(14, 14), stride=(1, 1), bias=False)
(de_bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(deconv2): ConvTranspose2d(512, 256, kernel_size=(8, 8), stride=(4, 4), padding=(2, 2), bias=False)
(de_bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(deconv3): ConvTranspose2d(256, 3, kernel_size=(8, 8), stride=(4, 4), padding=(2, 2), bias=False)
)
Here’s the training loop:
torch.cuda.empty_cache()
total_loss = []
total_loss_true = []
for epoch in range(max_epoch):
print("Epoch: ", epoch)
autoencoder_model.train()
epoch_loss = []
epoch_loss_true = []
progress = tqdm(total=num_batches, desc='epoch % 3d' % (epoch + 1))
for i in range(num_batches):
print("Batch: ", i)
X_clean = torch.zeros((batch_size, ch, h, w))
X_corrupted = torch.zeros((batch_size, ch, h, w))
# targets = torch.zeros(batch_size)
for j in range(batch_size):
tens, blah = processed_loader.dataset[i * batch_size + j]
X_clean[j] = preprocessing(tens)
# Corrupt image
opened_img, blah = processed_loader.dataset[i * batch_size + j]
opened_img_np = np.array(opened_img)
cor_img_float64 = impulse_noise(opened_img, severity)
# Very Important -> Conversion from float64 to uint8
cor_img_uint8 = cor_img_float64.astype('uint8')
corrupted_tensor = convert_to_tensor(cor_img_uint8)
norm_corrupted_tensor = normalization(corrupted_tensor)
X_corrupted[j] = norm_corrupted_tensor
################## Get Training & Traget Dataset ##################
X_clean = Variable(X_clean.type(torch.FloatTensor)).to(device)
X_corrupted = Variable(X_corrupted.type(torch.FloatTensor)).to(device)
################## Train and Backpropagation ##################
optimizer_model.zero_grad()
codes, rec_X = autoencoder_model(X_corrupted)
loss = calculate_loss(true_data=X_corrupted, pred_data=rec_X)
loss_true = calculate_loss(true_data=X_clean, pred_data=rec_X)
print("True Loss: ", loss_true)
print(X_clean.is_cuda)
del X_clean
del X_corrupted
del codes
del rec_X
torch.cuda.empty_cache()
loss.backward()
optimizer_model.step()
epoch_loss.append(loss.item())
epoch_loss_true.append(loss_true.item())
progress.set_postfix({'loss': loss.item()})
progress.update()
progress.close()
total_loss.append(np.mean(epoch_loss))
total_loss_true.append(np.mean(epoch_loss_true))