Out of CPU Memory

I’m currently experiencing a CPU Memory shortage, so I would like to get help.

I’m using about 400,0006464 (about 48G) and I have 32G GPU Memory.

If I train using the codes below, the memory usage is over 90%.
The code of my custom dataset is below.

class trainDataset(torch.utils.data.Dataset):
    def __init__(self, i, data_path, augmentation=True):
        self.data_path = data_path
        self.data = np.load(data_path+'image{}.npy'.format(i)).astype(np.uint16)
        self.target = np.load(data_path+'label{}.npy'.format(i)).astype(np.uint8)
        self.augmentation = augmentation
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        x, y = self.transform(x, y)
        return x, y
    def transform(self, data, target):
        data, target = data_augmentation(data, target, self.augmentation)
        return data, target
    def __len__(self):
        return len(self.data)

I used ConcatDataset through other previous posts and concatenated into one dataset by creating 58 datasets as show below.

for i in range(58):
traindataset = torch.utils.data.ConcatDataset(train_datasets)
print("Load TrainDataset Done")

And my training code is as below
I also used “torch.set_num_threads()” to use Multi-Threads

def fit(epoch,model,data_loader,phase='train',volatile=False):
    if phase == 'train':
    if phase == 'valid':
    running_loss = 0.0
    for batch_idx , (data,target) in enumerate(data_loader):
        inputs,target = data.to(device),target.to(device)
        with torch.set_grad_enabled(phase == 'train'):         
            output = model(inputs).to(device)
            loss = criterion(output,target.long()).to(device)
            if phase == 'train':
        running_loss += loss.data
    if phase == 'train':
    loss = running_loss/len(data_loader.dataset)
    print('{} Loss: {:.4f}'.format(
                phase, loss))
    return loss
init_state = copy.deepcopy(model.state_dict())
init_state_opt = copy.deepcopy(optimizer.state_dict())
init_state_lr = copy.deepcopy(exp_lr_scheduler.state_dict())

since = time.time()
train_losses = []
val_losses = []

print('train : {}, valid : {}'.format(len(trainloader.dataset), len(validloader.dataset)))
early_stopping = EarlyStopping(patience=5, verbose=1)
for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch, num_epochs - 1))
    print('-' * 10)
    epoch_loss = fit(epoch,model,trainloader,phase='train')
    val_epoch_loss = fit(epoch,model,validloader,phase='valid')
    if early_stopping.validate(val_epoch_loss):

I wonder if there is insufficient memory because of the size of the files. Or are there other ways to reduce memory?

Any help would be appreciated

How large is each .npy file and how many are you loading at once?

Thank you for the reply!

Each .npy of input is less than 1GB and averages about 700MB.
The label has a smaller capacity.
The total train capacity combined with input and label is about 48GB.

There are a total of 58 input npy and label npy,
For each dataset, one input and one label are loaded.
A total of 58 datasets are created, and ConcatDataset is used to create one dataset (Final train dataset).

Thanks for the information!
Since you are loading each .npy in the Dataset.__init__ method, ConcatDataset will append all already initialized datasets and will thus use a huge amount of memory.

Would it be possible to create a loop over each '.npyfile and create a singletrainDatasetinside this loop for the current.npy`?
This would hold a single dataset in memory at any time and should reduce the memory foot print significantly.

1 Like

Thanks for the idea!

The idea you mentioned is not using ConcatDataset, but creating a single traindataset for each npy.
After that, train 58 traindatasets in the train loop as one epoch.
Is what I understand correct?

Thanks again for your help :grinning:

Yes, you won’t be able to shuffle the samples between the different datasets, but it might be OK, if you shuffle inside each dataset.

Thanks to this, my research area has expanded!
Thank you very much :+1: :+1: :+1:

1 Like