Questions about GPU allocate and cache

I have a question about GPU allocate and cache.

I have seen GPU allocation and cache using code like below (Below is nothing loaded on the GPU)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

if device.type == 'cuda':
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')
Using device: cuda

Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB

However, when I loaded the model and inputs(targets) into the GPU and checked the code above for each step, I found something strange.

model =

for batch_idx , (data,target) in enumerate(data_loader):
        inputs,target =,

memory use

As shown in the picture, during training and validation, only GPU memory was allocated 0.1GB and most of them went into the cache.

Because of this, CPU usage seems to increase abnormally.

I loaded both the model and inputs(targets) into the GPU. Why is the GPU allocation only 0.1GB?
How do I increase GPU allocation?


If your model is relatively small, it is not abnormal for it only to use 100MB.
What takes a lot of memory are all the intermdiate results needed for gradient computation. You can check the max allocated memory to see the peak usage.

Thanks for reply!
As you said, I saw peak usage and use about 8GB.

But another question arises: the GPU sees it used, but the CPU usage is weird.

Until I run the model, the CPU usage is 30-40%, but if I run the model with the following code, the CPU usage is close to 90-100%.

model = UNet(n_class=2)

if torch.cuda.is_available():
    model = model.cuda()

class_weights = torch.tensor([1.0, 2.0]).cuda()
criterion = nn.CrossEntropyLoss(weight=class_weights)
optimizer = optim.Adam(model.parameters(),lr=0.0005)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
def fit(epoch,model,data_loader,phase='train',volatile=False):
    if phase == 'train':
    if phase == 'valid':
    running_loss = 0.0
    for batch_idx , (data,target) in enumerate(data_loader):
        if is_cuda:
            inputs,target = data.cuda(),target.cuda()
        if phase == 'train':
        output = model(inputs)
        loss = criterion(output,target.long())           
        running_loss +=
        if phase == 'train':
    loss = running_loss/len(data_loader.dataset)

    print('{} Loss: {:.4f}'.format(
                phase, loss))
    return loss
init_state = copy.deepcopy(model.state_dict())
init_state_opt = copy.deepcopy(optimizer.state_dict())
init_state_lr = copy.deepcopy(exp_lr_scheduler.state_dict())

since = time.time()
train_losses = []
val_losses = []

print('train : {}, valid : {}'.format(len(trainloader.dataset), len(validloader.dataset)))
early_stopping = EarlyStopping(patience=5, verbose=1)
for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch, num_epochs - 1))
    print('-' * 10)
    epoch_loss = fit(epoch,model,trainloader,phase='train')
    val_epoch_loss = fit(epoch,model,validloader,phase='valid')
    if early_stopping.validate(val_epoch_loss):
time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format(
time_elapsed // 60, time_elapsed % 60))

At first I was suspicious of data augmentation and the size of the data, so I reduced the size without applying data augmentation, but still CPU usage is close to 90-100%.

Can you give me some advice on CPU usage?

CPU usage of 100% (that means a single core used), is expected. The CPU has the task of running the python code + queue up work for the GPU. If the model is small, this is expected that one thread is fully used on the CPU.

1 Like

Thank you for your help.
Thanks to you, I have been studying deeper.
I used your torch.set_num_threads (2)’ as a reference to your previous article, and now I have significantly reduced cpu usage.

Again Thanks! :blush: