CPU RAM usage increasing for every epoch

Hello,
I’m running into troubles while training a CAE(Convolutional Auto Encoder) model. I defined my own dataset class as follows:

def make_dataset(dir, class_to_idx, extensions):
    label_list = []
    input_list = []
    dir = os.path.expanduser(dir)
    for target in sorted(class_to_idx.keys()):
        d = os.path.join(dir, target)
        if not os.path.isdir(d):
            continue

        for root, _, fnames in sorted(os.walk(d)):
            for fname in sorted(fnames):
                if has_file_allowed_extension(fname, extensions):        
                    if target == 'Source':
                        if fname.endswith('.wav'):
                            path = os.path.join(root, fname)
                            item = (path)
                            label_list.append(item)
                    else:
                        if fname.endswith('.wav'):
                            path = os.path.join(root, fname)
                            item = (path)
                            input_list.append(item)
    return input_list, label_list



class msd_dataset(data.Dataset):

    def __init__(self, root, extensions, train=True, input_transform=None, target_transform=None):

        self.extensions = extensions
        self.train = train

        #Join the Train/Test to the existing path if user specifies Train=True/False respectively.
        if self.train:
            self.root = os.path.join(root, 'Train')
        else:
            self.root = os.path.join(root, 'Test')
          
        classes, class_to_idx = self._find_classes(self.root)
       
        samples = make_dataset(self.root, class_to_idx, extensions)
        #Sanity check 
        if (len(samples[0]) or len(samples[1])) == 0:
            raise(RuntimeError("Found 0 files in subfolders of: " + self.root + "\n"

       self.classes = classes
        self.samples = samples

        self.input_transform = input_transform
        self.target_transform = target_transform

    def _find_classes(self, dir):
       
        if sys.version_info >= (3, 5):
            # Faster and available in Python 3.5 and above
            classes = [d.name for d in os.scandir(dir) if d.is_dir()]
        else:
            classes = [d for d in os.listdir(dir) if os.path.isdir(os.path.join(dir, d))]
        classes.sort()
        class_to_idx = {classes[i]: i for i in range(len(classes))}
        return classes, class_to_idx

def __getitem__(self, index):
        path, target = self.samples
        #Loading Audio with librosa 
        input_mixture, sr1 = librosa.load(path[index], sr=44100, mono=False)
        label_source, sr2 = librosa.load(target[index], sr=44100, mono=False)
        if self.input_transform is not None:
            input_mixture = input_mixture.reshape(input_mixture.shape[0], 1, input_mixture.shape[1])
            input_mixture = torch.from_numpy(input_mixture)
            input_mixture = self.input_transform(input_mixture)
        if self.target_transform is not None:
            label_source = label_source.reshape(label_source.shape[0], 1, label_source.shape[1])
            label_source = torch.from_numpy(label_source)
            label_source = self.target_transform(label_source)
        return input_mixture, label_source

    def __len__(self):
        return len(self.samples[0])

How can I fix this?

I have seen some post regarding this memory issue. But I didn’t any solution. I am using gc.collect() after every epoch.

Do you get any error message or do you have another issue regarding this Dataset?

PS: I’ve formatted your code. You can add code using three backticks `.

I don’t have any issues with this code. But cpu RAM is increasing continuously while training. I am loading one by one WAV file with dataloader

Hi,
I am also facing similar issues. In my case, I have all the features in disk as .pt file. I am loading it into RAM as some global variables and using in the dataloader by indexing it. The problem is, CPU RAM is increasing every epoch and after some epochs the process got killed by the OS. My question is, I already loaded the features into the memory, in the dataloader i am just using it, how this is consuming extra memory?
Thanks

@kunasiramesh, @Gkv The memory issue might be related to the training procedure or another part of the code.
Could you post the code so that we can have a look?
Usually the computation graph is unintentionally stored somewhere, e.g. by using losses += loss instead of losses += loss.item().

11 Likes

@ptrblck: just to understand a bit more about OOM issues, if the computation graphs are stored unintentionally (assuming GPUs are used for training), it should lead to GPU out of memory. Am I right?

Exactly! However, if you are using the model on CPU, your CPU RAM will be filled.
Maybe that’s the case here.

Hi,
I am posting the main part of the code here.


#new
global full_train_set1
global full_train_set2
global val_set1
global val_set2
full_train_set1=torch.load('path to the featueres1 dict.pt')
full_train_set2=torch.load('path to the featueres2 dict.pt') 

val_set1=torch.load('path to the val features1 dict.pt')
val_set2=torch.load('path to the val feature2 dict.pt')
class MyCustomDataset(Dataset):
    def __init__(self,ids,train):
        self.names = ids
        self.train=train
    def __getitem__(self, index):
        my_id=self.names[int(index)]
        if self.train:
        	feature1=full_train_set1[my_id]
            feature2=full_train_set2[my_id]
        	target=full_train_set1[my_id]['target']
        else:
        	feature1=val_set1[my_id]
            feature2=val_set2[my_id]
        	target=val_set[my_id]['target']
        data=(feature1,feature2)
        return (data,target) 
       
    def __len__(self):
        return len(self.names)



def train(epoch,train_loader,optimizer,criterian):
    print('Training epoch..',epoch)
    model.train()
    torch.cuda.empty_cache()
    train_loss=0
    b_cnt=0
    for data,target in train_loader:
        b_cnt=b_cnt+1
        data=list(data)
        data[0]=data[0].cuda()
        data[1]=data[1].cuda()
        target=target.cuda().float()
        optimizer.zero_grad()
        pred=model(data).cuda()
        loss=criterian(pred,target).cuda()
        train_loss+=math.sqrt(loss.item()) 
        loss.backward()
        optimizer.step()
        torch.cuda.empty_cache()
        gc.collect()
    torch.cuda.empty_cache()
    return train_loss/float(b_cnt)



def run():
    torch.cuda.empty_cache()
    n_epochs=100
    print("Running...")
    global n_batches
    n_batches=100
    global model
    model=my_model().cuda()
    model=nn.DataParallel(model,device_ids=[0,1,2])
    criterian=nn.MSELoss().cuda()
    train_ids=find_dataset(full_train_set1) # train_ids is a list 
    l_rate=0.001
    optimizer=torch.optim.Adam(model.parameters(), lr=l_rate)
    train_loader = torch.utils.data.DataLoader(dataset=MyCustomDataset(train_ids,train=True),
                                                batch_size=n_batches,
                                                shuffle=True)

    for epoch in range(1,n_epochs+1):
        gc.collect()
        train_loss=train(epoch,train_loader,optimizer,criterian)
        print('Epoch :'+str(epoch)+': Train rmse:',train_loss)

run()

Since the gradient update is usually done at CPU while Forward and backward are done on GPU, you could remove the .cuda() call at these parts of your code

and put this part of your code before moving the model to GPU:

Your run-function would now look like this:

def run():
    torch.cuda.empty_cache()
    n_epochs=100
    print("Running...")
    global n_batches
    n_batches=100
    global model
    model=my_model()
    optimizer=torch.optim.Adam(model.parameters(), lr=l_rate)

    model = model.cuda()

    model=nn.DataParallel(model,device_ids=[0,1,2])
    criterian=nn.MSELoss().cuda()
    train_ids=find_dataset(full_train_set1) # train_ids is a list 
    l_rate=0.001
    
    train_loader = torch.utils.data.DataLoader(dataset=MyCustomDataset(train_ids,train=True),
                                                batch_size=n_batches,
                                                shuffle=True)

    for epoch in range(1,n_epochs+1):
        gc.collect()
        train_loss=train(epoch,train_loader,optimizer,criterian)
        print('Epoch :'+str(epoch)+': Train rmse:',train_loss)

This would prevent loss function and optimizer from living on GPU (and thus decrease the GPU memory usage). However, since you run OOM on CPU, I would first try to load only the parts of your dataset you need just-in-time instead of loading the whole (probably huge) datasets before and cache them. Saying that, your CPU RAM should not increase but stay approx. at the same level of usage.

Hi,
I included those .cuda() after this problem occurred. Previously without this also the problem was there. My feature dictionaries are around 6gb in total and i am running this on a system with 251 gb RAM. Then how this will happen?. Is there any problem with the global variables?

Can you try to run it without global variables (I.E. passing the model to the function as parameter and loading your data inside your Dataset’s __init__)? I usually try to avoid global variables.

Passing the model to the function as a parameter is not solving the problem. The dataloader part I will do and update the results here.

I tried loading the data inside init. But no change.

Hi,
The problem is in the Dataloader. When i simply iterate over the dataloader, the memory is increasing. I don’t know why? Any thoughts?
Thanks

1 Like

Solved my problem perfectly!

Perfect! I have solved my question!

Thank you very much!!