Not enough CPU memory while training on CUDA

Training was stopped with the following message: DefaultCPUAllocator: not enough memory: you tried to allocate 58720256 bytes: Buy new RAM!.

However, I assigned 1) my network (binary image classification), 2) input image (N*C*D*H*W = 32*1*7*256*256), and 2) label (32*1) to my GPU (2080Ti). I think the training shouldn’t demand much CPU memory since DataSet (including random image augmentation) and DataLoader are the only processes for CPU.

From Task Manager, I found that the training requires about 24GB from CPU memory.

Why does the training requires much memory from CPU while I used CUDA for training? Thanks in advance!! :slight_smile:

24GB seem to be quite a lot.
Are you storing any actications, checkpoints, or preload the dataset?
Could you also post your code so that we could have a look?

I instantiated SummaryWriter (tensorboardX) before training begins and use it to store the graph every epoch via ‘add_graph’ . Would it be the problem?

Below is my code snippet:

import torch
import numpy as np
from torch.utils.data import DataLoader
from tensorboardX import SummaryWriter
import torch.optim as optim
import torch.nn.functional as F
import pandas as pd
import data_provider
import network_fin
import performance_assess
torch.autograd.set_detect_anomaly(True)

my_cuda_device_number = 0
my_cuda_device = torch.device("cuda:"+str(my_cuda_device_number))
max_epoch=500
my_batch_size = 32

torch.backends.cudnn.benchmark = True
torch.backends.cudnn.fastest = True
net = network_fin.Net_R21D().to(my_cuda_device)

.....

optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9, weight_decay =1e-2, nesterov =False)

scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.3, patience=7, min_lr = 1e-9)

ds_tr = data_provider.CustomDataset_oversampling(csv_tr, transform=True, mask_resize_ratio=8)
ds_val = data_provider.CustomDataset(csv_val, transform=False, mask_resize_ratio=8)

sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weights, len(sample_weights))

dataloader_tr = DataLoader(ds_tr, batch_size=my_batch_size, shuffle=False, sampler = sampler, drop_last= True)
dataloader_val = DataLoader(ds_val, batch_size=my_batch_size, shuffle=False)


writer = SummaryWriter(result_save_directory)

for epoch in range(max_epoch):

        torch.set_grad_enabled(True)
        net.train()

        for i, data_tr in enumerate(dataloader_tr, 0):
            inputs_tr_tensor, labels_temp_tr_tensor, patient_number_temp_tr_tensor = data_tr

            inputs_tr_tensor= inputs_tr_tensor.cuda(my_cuda_device)
            labels_temp_tr_tensor = labels_temp_tr_tensor.cuda(my_cuda_device)
            optimizer.zero_grad()
            logits_category_tr = net(inputs_tr_tensor)
            logits_category_tr = ((logits_category_tr).double()).view(inputs_tr_tensor.size()[0])

            loss_BCE_category_tr = F.binary_cross_entropy_with_logits(logits_category_tr, labels_temp_tr_tensor, reduction='none')
            loss_focal_category_tr = (((1-torch.exp(-loss_BCE_category_tr))**2)*loss_BCE_category_tr)
            probs_temp_tr_tensor = torch.sigmoid(logits_category_tr)
            preds_temp_tr_tensor = probs_temp_tr_tensor.round()
            (torch.mean(loss_focal_category_tr)).backward()
            optimizer.step()

            # custom performance recorder update
            performance_recorder_trainingset.update_one_epoch_list(accuracy=......)
            
            step_number += 1

        torch.set_grad_enabled(False)
        net.eval()
	for i_val, data_val in enumerate(dataloader_val, 0):
		
		#...validation loop
	
	writer.add_graph(net, inputs_val_tensor)
	
	scheduler.step(list(performance_recorder_trainingset.get_loss_list())[-1])


    torch.save(net.state_dict(), result_save_directory+'/model_save.pth')
    print('Training Finished')

    torch.save(net, result_save_directory+'/model_fin.pth')

Thank you!!

I’m not sure, but I assume this shouldn’t be the case.
Are you seeing the large memory usage without using tensorboard?

However, you are also using detect_anomaly, which could slow down your code significantly.
Could you disable it and check the memory usage again?

Thank you for the response.

I disabled detect_anomaly and checked the CPU memory usage. I found there had been a memory leaking problem while running this code. However, I found the memory leaking problem maintained even after I disabled detect_anomaly. Below is the result of perfmon (Windows)

In addition, I found my GPU memory did not leak. I am running this code in virtual environment of python and am using Windows.

I did not disable tensorboard.

I disabled TensorboardX and its function add_graph, which solved the problem.