CPU RAM usage increases inside each epoch and keeps increasing for all epochs (OSError: [Errno 12] Cannot allocate memory)

After monitoring CPU RAM usage, I find that RAM usage increases for all epoch. During an epoch run, memory keeps constantly increasing. RAM isn’t freed after epoch ends. Usage keeps increasing when new epoch comes. Hence, memory usage doesn’t become constant after running first epoch as it should have. Eventually after some epochs, this leads to OOM error on CPU. To be noted that GPU memory stays constant after first epoch.

From reading related posts I believe the problem is in custom Dataset implementation, although I can’t point out where. I gather all image file names in a list, save it as csv and then use loaded dataframe from csv in MyDataset class to load data, since people on forum have cautioned against using python lists directly to load data.

Some things I have tried but didn’t work:

  1. Setting num_workers = 0
  2. Using numpy array of file names instead of python list to load data
  3. Decreased batch size to 1

For 100k images and batch size 50, this error comes at 25th epoch, for 50k images it came at 50th epoch.

This is a 2x super resolution script with (16,16,4) input images and (32,32,4) output images.
Code:

  • train.py -> Main training loop and gathering file names (added here)
  • progressive_loader.py -> Custom dataset implementation (added here)
  • prosrs.yaml -> hyperparameters & configuration file
  • generators.py -> model (modified densenet kinda architecture ~50+ layers)

train.py:

def load_dataset(args):
    files = {'train':{},'test':{}}

    for phase in ['train','test']:
        for ft in ['source','target']:
            if args[phase].dataset.path[ft]:
                files[phase][ft] = get_filenames(
                    args[phase].dataset.path[ft], image_format=IMG_EXTENSIONS)
            else:
                files[phase][ft] = []

    return files['train'],files['test']


def main(args):

    ############### loading datasets #################
    train_files,test_files = load_dataset(args)
    print("Dataset images retrieved")

    num_images_to_train = 100000
    train_files['target'] = train_files['target'][:num_images_to_train]
    test_files['target'] = train_files['target'].copy()

    with open('imagepaths.csv', "w") as output:
        writer = csv.writer(output, lineterminator='\n')
        for val in train_files['target']:
            writer.writerow([val])


   # Dataset passing

    training_dataset = MyDataset(
        prosr.Phase.TRAIN,
        scale=args.data.scale,
        input_size=args.data.input_size,
        args=args,
        **args.train.dataset)

    training_data_loader = DataLoader(
        training_dataset, batch_size=args.train.batch_size, shuffle=False, num_workers=4)

    if len(test_files['target']):
        testing_dataset = MyDataset(
                prosr.Phase.VAL,
                scale=args.data.scale,
                input_size=None,
                args=args,
                **args.test.dataset)
        testing_data_loader = DataLoader(testing_dataset, batch_size=1, shuffle=False, num_workers=4)
    else:
        testing_dataset = None
        testing_data_loader = None


    start_epoch = 0
    lr = args.train.lr
    # save_dir = args.cmd.output
    steps_per_epoch = len(training_data_loader)
    total_steps = start_epoch * steps_per_epoch


    ############# start training ##############

    batchsize = args.train.batch_size
    print("Batch size = ", batchsize)
    print("Num batches size = ", len(training_data_loader))
    loss = []
    psnr_list = []
    # output_imgs = torch.zeros((len(trainer.training_dataset)*batchsize, 4, 32, 32))
    num_random = 100
    HR_imgs = torch.zeros((num_random, 4, 32, 32))
    output_imgs = torch.zeros((num_random, 4, 32, 32))
    random_indices = randint(0, (len(training_data_loader)*batchsize)-1, num_random)

    os.environ["CUDA_VISIBLE_DEVICES"] = "0"
    args.G.max_scale = max(args.data.scale)
    net_G = ProSR(**args.G).cuda()
    optimizer_G = torch.optim.Adam(
        [p for p in net_G.parameters() if p.requires_grad],
        lr=args.train.lr,
        betas=(0.9, 0.999),
        eps=1.0e-08)
    l1_criterion = torch.nn.L1Loss()


    #########################################################################
    for epoch in range(start_epoch + 1, args.train.epochs + 1):
        iter_start_time = time()
        epoch_start_time = time()
        net_G.train()
        epoch_loss = 0
        print("Epoch: ", epoch)
        for i, data in enumerate(training_data_loader):

            # Forward and backward pass
            lr = data['input'].cuda()
            hr = data['target'].cuda()
            interpolated = data['bicubic'].cuda()
            output_batch = net_G(lr, upscale_factor=2) + interpolated
            optimizer_G.zero_grad()
            l1_loss = l1_criterion(output_batch, hr)
            l1_loss.backward()
            optimizer_G.step()

            epoch_loss += l1_loss
            total_steps += 1     
            #################################################################

progressive_loader.py:

def pil_loader(path, args, mode='RGBA'):
    with open(full_path, 'rb') as f:
        with Image.open(f) as img:
            return img.convert(mode)

def downscale_by_ratio(img, ratio=2, method=Image.BICUBIC):
    if ratio == 1:
        return img
    w, h = img.size
    w, h = floor(w / ratio), floor(h / ratio)
    return img.resize((w, h), method)



class MyDataset(Dataset):

    def __init__(self, phase, scale, input_size, args, mean,
                 stddev, downscale, **kwargs):

        self.phase = phase
        self.scale = 2
        self.mean = mean
        self.stddev = stddev
        self.args = args
        self.image_loader = pil_loader
        self.downscale=downscale

        self.data_frame = pd.read_csv("imagepaths.csv")

        # Input normalization
        self.normalize_fn = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(self.mean, self.stddev)
        ])

    def __len__(self):
        return len(self.data_frame)

    def __getitem__(self, index):
        return self.get(index)

    def get(self, index, scale=2):

        if torch.is_tensor(index):
            index = index.tolist()

        scale = 2
        ret_data = {}
        ret_data['scale'] = scale

        # Load target image
        if len(self.data_frame):
            target_img = self.image_loader(self.data_frame.iloc[index, 0], self.args)

            ret_data['target'] = target_img
            ret_data['target_fn'] = self.data_frame.iloc[index, 0]
            ret_data['input'] = downscale_by_ratio(
                ret_data['target'], scale, method=Image.BICUBIC)
            ret_data['input_fn'] = self.data_frame.iloc[index, 0]



            # Change Image.BICUBIC to Image.BILINEAR
            ret_data['bicubic'] = downscale_by_ratio(
                ret_data['input'], 1 / scale, method=Image.BICUBIC)

            ret_data['input'] = self.normalize_fn(ret_data['input'])
            ret_data['bicubic'] = self.normalize_fn(ret_data['bicubic'])
            if len(self.data_frame):
                ret_data['target'] = self.normalize_fn(ret_data['target'])

        return ret_data

Can someone suggest where is the problem?

1 Like

It seems you are storing the complete computation graph in this line of code:

epoch_loss += l1_loss

If you want to use epoch_loss for printing/debugging purposes (i.e. without wanting to call epoch_loss.backward() in the future), you should detach the l1_loss before accumulating it via:

epoch_loss += l1_loss.detach() # or .item()
9 Likes

Yup, this was the issue. Appreciate your help.

1 Like

Thanks, your solution resolved the issue in my case as well.

I got the issue and understand the solution you provided, but I do not understand the why this causes the steady CPU memory increasing problem. Could you please point out some reference materials for me?

In the posted example l1_loss is attached to the computation graph, which stores all intermediates which are needed to compute the gradients in the backward call. Once the backward is executed PyTorch will delete the computation graph and free the intermediate tensors since they are not needed anymore. However, PyTorch can only delete tensors if no objects stores a reference to it anymore.

epoch_loss += l1_loss

will accumulate the l1_loss to epoch_loss and thus keep the references to the entire comptuation graph alive, which will thus increase the memory usage in each step.

1 Like

I see. Is there anyway to forcelly free all existing computational graph?

1 Like

Yes, you have to delete all references to the computation graph and Python will free it via its GC mechanism.

2 Likes

Thanks! This solution saves my life

Sorry, I would like to ask why even if the tensor is moved to the GPU, the increased usage is still in the RAM of the CPU? Is it because the computation graph is stored in the RAM of the CPU?

If intermediate activations were computed and stored on the CPU, moving the final loss to the GPU will not move the entire computation graph as well (assuming this was your question).

If I put the tensor on the GPU before it enters the model, will the intermediate activations still be calculated and stored on the CPU? Will the computation graph be stored on the CPU?


    for inputs, targets in train_loader:
        
        inputs = inputs.to(device)
        targets = targets.to(device)

        preds = model(inputs)
        loss = loss_fn(preds, targets)

Another question I’m very curious about is that even if the data to be input into the model has been moved to the GPU, the following line of code will still cause the CPU RAM usage to increase. This is contrary to what I originally thought should be the GPU memory usage. To increase, what is the specific reason for the increase in CPU RAM usage?

epoch_loss += l1_loss

No, since the tensor and thus the computation and output tensor creation will be performed on the GPU.
The tensor placement defined which device will be used for the execution of the operation.
E.g. out = a * b will perform the multiplication in the same device where a and b are located and will create out there as well.

Could you post a minimal code snippet showing this behavior?

First of all, thank you very much for your response.
Sure, the following code will cause a sharp increase in RAM usage (even if the tensor is moved to the GPU).

import torch
import sys
from tqdm import tqdm

b = torch.zeros(5).cuda()

d1 = torch.zeros(1,requires_grad=True).cuda()
d2 = torch.zeros(1,requires_grad=True).cuda()
d3 = torch.zeros(1,requires_grad=True).cuda()
d4 = torch.zeros(1,requires_grad=True).cuda()
d5 = torch.zeros(1,requires_grad=True).cuda()
d6 = torch.zeros(1,requires_grad=True).cuda()
d7 = torch.zeros(1,requires_grad=True).cuda()
d8 = torch.zeros(1,requires_grad=True).cuda()

for i in tqdm(range(1000000)):
    b +=(d1*d2*d3*d4*d5*d6*d7*d8)

The reason I wrote this sample program is that, in my custom loss function, there is indeed a similar calculation method.

Responded in your cross-post.