How to minimize the reserved GPU memory?

songsong0425 · September 4, 2025, 2:40am

Hi, always thank you for your valuable comments.

I have a question about the GPU memory usage, especially reserved memory.

When I run the RGATConv-based model in the torch_gemoetric, the size of the reserved memory increased rapidly.

I defined the GPU memory screening function as below:

def print_gpu_usage(note=“”):
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / (1024 ** 2)
reserved = torch.cuda.memory_reserved() / (1024 ** 2)
print(f"[GPU] {note} | Allocated: {allocated:.2f} MB | Reserved: {reserved:.2f} MB")
else:
print(“CUDA is not available.”)

And inserted the function like this:

model.train()
for data in tqdm(train_loader):
tr_loss =
data = data.to(device)
data = subgraph_sampling_per_cellline(data)
print_gpu_usage(“After loading datasets on the GPU”)

if tr_loss:
scaler.scale(tr_loss_mean).backward()
print_gpu_usage(“After backward”)
scaler.step(optimizer)
scaler.update()
print_gpu_usage(“After update”)

Then, the result looks weird.

FOLD 0Data processing…Done[GPU] After model init | Allocated: 4.60 MB | Reserved: 22.00 MBStarting to train and test model0%| | 0/6410 [00:00<?, ?it/s][GPU] After loading datasets on the GPU | Allocated: 77.64 MB | Reserved: 80.00 MB[GPU] After forward | Allocated: 13020.38 MB | Reserved: 15252.00 MB[GPU] After backward | Allocated: 155.22 MB | Reserved: 33714.00 MB[GPU] After update | Allocated: 164.39 MB | Reserved: 33714.00 MB

0%| | 1/6410 [00:00<1:33:09, 1.15it/s][GPU] After loading datasets on the GPU | Allocated: 162.91 MB | Reserved: 33714.00 MB[GPU] After forward | Allocated: 9008.47 MB | Reserved: 33714.00 MB[GPU] After backward | Allocated: 161.62 MB | Reserved: 33714.00 MB[GPU] After update | Allocated: 161.62 MB | Reserved: 33714.00 MB

0%| | 2/6410 [00:01<46:35, 2.29it/s][GPU] After loading datasets on the GPU | Allocated: 163.68 MB | Reserved: 33714.00 MB[GPU] After forward | Allocated: 12093.81 MB | Reserved: 52474.00 MB[GPU] After backward | Allocated: 158.96 MB | Reserved: 61854.00 MB[GPU] After update | Allocated: 158.96 MB | Reserved: 61854.00 MB

0%| | 3/6410 [00:01<33:34, 3.18it/s][GPU] After loading datasets on the GPU | Allocated: 157.91 MB | Reserved: 61854.00 MB[GPU] After forward | Allocated: 11848.42 MB | Reserved: 61858.00 MB[GPU] After backward | Allocated: 164.30 MB | Reserved: 61858.00 MB[GPU] After update | Allocated: 164.30 MB | Reserved: 61858.00 MB

0%| | 4/6410 [00:01<26:08, 4.08it/s][GPU] After loading datasets on the GPU | Allocated: 165.70 MB | Reserved: 61858.00 MB[GPU] After forward | Allocated: 15850.53 MB | Reserved: 61858.00 MB[GPU] After backward | Allocated: 163.28 MB | Reserved: 61858.00 MB[GPU] After update | Allocated: 163.28 MB | Reserved: 61858.00 MB

0%| | 5/6410 [00:01<24:15, 4.40it/s][GPU] After loading datasets on the GPU | Allocated: 162.42 MB | Reserved: 61858.00 MB[GPU] After forward | Allocated: 14547.31 MB | Reserved: 61860.00 MB[GPU] After backward | Allocated: 161.62 MB | Reserved: 61862.00 MB[GPU] After update | Allocated: 161.62 MB | Reserved: 61862.00 MB

0%| | 6/6410 [00:01<22:03, 4.84it/s][GPU] After loading datasets on the GPU | Allocated: 162.35 MB | Reserved: 61862.00 MB[GPU] After forward | Allocated: 15718.04 MB | Reserved: 61862.00 MB[GPU] After backward | Allocated: 166.97 MB | Reserved: 61862.00 MB[GPU] After update | Allocated: 166.97 MB | Reserved: 61862.00 MB

0%| | 7/6410 [00:01<20:58, 5.09it/s][GPU] After loading datasets on the GPU | Allocated: 166.33 MB | Reserved: 61862.00 MB[GPU] After forward | Allocated: 15294.10 MB | Reserved: 61862.00 MB[GPU] After backward | Allocated: 165.38 MB | Reserved: 61864.00 MB[GPU] After update | Allocated: 165.38 MB | Reserved: 61864.00 MB

0%| | 8/6410 [00:02<20:10, 5.29it/s][GPU] After loading datasets on the GPU | Allocated: 164.57 MB | Reserved: 61864.00 MB[GPU] After forward | Allocated: 9452.27 MB | Reserved: 61864.00 MB[GPU] After backward | Allocated: 164.03 MB | Reserved: 61864.00 MB[GPU] After update | Allocated: 164.03 MB | Reserved: 61864.00 MB

0%| | 9/6410 [00:02<18:24, 5.79it/s][GPU] After loading datasets on the GPU | Allocated: 165.56 MB | Reserved: 61864.00 MB[GPU] After forward | Allocated: 13367.78 MB | Reserved: 72252.00 MB[GPU] After backward | Allocated: 159.69 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 159.69 MB | Reserved: 82640.00 MB

0%| | 10/6410 [00:02<18:23, 5.80it/s][GPU] After loading datasets on the GPU | Allocated: 158.75 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 12570.48 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 168.32 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 168.32 MB | Reserved: 82640.00 MB

0%| | 11/6410 [00:02<18:11, 5.86it/s][GPU] After loading datasets on the GPU | Allocated: 167.36 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 11526.41 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 166.10 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 166.10 MB | Reserved: 82640.00 MB

0%| | 12/6410 [00:02<17:12, 6.20it/s][GPU] After loading datasets on the GPU | Allocated: 167.82 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 11734.72 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 168.06 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 168.06 MB | Reserved: 82640.00 MB

0%| | 13/6410 [00:02<17:08, 6.22it/s][GPU] After loading datasets on the GPU | Allocated: 167.23 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 10795.68 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 164.24 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 164.24 MB | Reserved: 82640.00 MB

0%| | 14/6410 [00:02<16:43, 6.37it/s][GPU] After loading datasets on the GPU | Allocated: 163.03 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 8074.03 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 162.18 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 162.18 MB | Reserved: 82640.00 MB

0%| | 15/6410 [00:03<15:50, 6.73it/s][GPU] After loading datasets on the GPU | Allocated: 162.43 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 8811.66 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 160.45 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 160.45 MB | Reserved: 82640.00 MB

As you can see, the reserved memory is much more than the allocated memory.

Since I’m using the large size of the knowledge graph (i.e., heterogeneous graph), it is a critical issue to deal with the GPU memory size.

I guess that the reserved memory evoked the GPU OOM problem.

How can I minimize the reserved memory? As much as possible, I want to make the reserved GPU memory dynamic.

Thank you for reading this question.

—
[Update]

When I try to use PYTORCH_NO_CUDA_MEMORY_CACHING=1 like here, it returned weird results as below.

FOLD 0
Data processing...
Done
[GPU] After model init | Allocated: 0.00 MB | Reserved: 0.00 MB
Starting to train and test model
  0%|                                                  | 0/6410 [00:00<?, ?it/s][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After backward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After update | Allocated: 0.00 MB | Reserved: 0.00 MB
  0%|                                        | 1/6410 [00:02<3:37:10,  2.03s/it][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After backward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After update | Allocated: 0.00 MB | Reserved: 0.00 MB
  0%|                                        | 2/6410 [00:02<2:30:08,  1.41s/it][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After backward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After update | Allocated: 0.00 MB | Reserved: 0.00 MB
  0%|                                        | 3/6410 [00:06<4:02:45,  2.27s/it][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After backward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After update | Allocated: 0.00 MB | Reserved: 0.00 MB
  0%|                                        | 4/6410 [00:07<3:11:50,  1.80s/it][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After backward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After update | Allocated: 0.00 MB | Reserved: 0.00 MB
  0%|                                        | 5/6410 [00:10<3:49:19,  2.15s/it][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
  0%|                                        | 5/6410 [00:13<4:40:21,  2.63s/it]
Traceback (most recent call last):
  File "/scratch/r902a02/workspace/KG-SLomics_Revision/src_27_RGAT_neg/ForGitHub/NegativeRatio/train_weightedloss_fixmetrics_addMLP_subgraphs_weightededges_AMP.py", line 332, in <module>
    train(args)      
  File "/scratch/r902a02/workspace/KG-SLomics_Revision/src_27_RGAT_neg/ForGitHub/NegativeRatio/train_weightedloss_fixmetrics_addMLP_subgraphs_weightededges_AMP.py", line 102, in train
    scaler.scale(tr_loss_mean).backward()
  File "/scratch/r902a02/.conda/envs/notebook/lib/python3.10/site-packages/torch/_tensor.py", line 626, in backward
    torch.autograd.backward(
  File "/scratch/r902a02/.conda/envs/notebook/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/scratch/r902a02/.conda/envs/notebook/lib/python3.10/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Arunprakash-A · September 4, 2025, 6:02am

Reserved memory is actually cached memory. Did you try torch.cuda.empty_cache()?

songsong0425 · September 4, 2025, 6:07am

Thank you for your reply. Yes, I tried to use torch.cuda.empty_cache(), but someone asked about it, which can slow down the training time (please see here). Where would you recommend inserting it?

Arunprakash-A · September 4, 2025, 6:21am

During the forward pass, memory usage increases as all intermediate activation values are stored. During the backward pass, gradients are computed, which again increases memory usage. Only after the optimization step (i.e., the update) will the memory be released. However, you also need to call zero_grad() to clear the memory used for gradients. I don’t see that in your code. I guess that could be one of the reasons

songsong0425 · September 4, 2025, 6:29am

Oh, sorry. I missed some code to reduce the length of the post. Here’s the whole training code.

def train(args):
early_stopping = EarlyStopping(patience=args.es, verbose=True)
scaler = GradScaler(‘cuda’)

for epoch in range(1, args.num_epochs+1):
    
    tr_losses = 0
    val_losses = 0
    tr_loss_sum = 0
    val_loss_sum = 0
    
    model.train()
    for data in tqdm(train_loader):
        tr_loss = []
        data = data.to(device)
        data = subgraph_sampling_per_cellline(data)
        print_gpu_usage("After loading datasets on the GPU")
        
        optimizer.zero_grad() #zero_grad is here
        
        # Mixed precision forward pass
        with autocast('cuda'):
            z, _, _ = model(data.x[0], data.x[1], data.n_id, data.edge_class, data.edge_mask, data.edge_type, track_attn=False)
            #data.edge_index|data.edge_mask
            print_gpu_usage("After forward")
            
            # Pre-compute weights once per batch instead of recalculating
            normalized_weights = compute_class_weights(data.edge_class)
            
            # Optimized loss calculation
            unique_classes = torch.unique(data.edge_class)
            for i in unique_classes:
                i = int(i.item())
                class_mask = (data.edge_class == i)
                
                if class_mask.sum() > 0:
                    tr_out = model.decode(z, data.edge_label_index[:, class_mask])
                    loss = criterion(tr_out, data.edge_label[class_mask].float())
                    weighted_loss = loss * normalized_weights[i]
                    tr_loss.append(weighted_loss)
            
            if tr_loss:
                tr_loss_mean = sum(tr_loss) / len(tr_loss)
        
        # Mixed precision backward pass
        if tr_loss:
            scaler.scale(tr_loss_mean).backward()
            print_gpu_usage("After backward")
            scaler.step(optimizer)
            scaler.update()
            print_gpu_usage("After update")
            torch.cuda.empty_cache()

            tr_losses += tr_loss_mean.item()
    avg_tr_loss = tr_losses/len(train_loader)
...
# validation process
        model.eval()
        with torch.no_grad():
            y_val_pred, y_val_pred_prob, y_val_true, edge_class_val = [], [], [], []
            val_loss = []
            for data in tqdm(val_loader):
                data = data.to(device)
...

Arunprakash-A · September 4, 2025, 6:30am

If memory is the bottleneck, you can call torch.cuda.empty_cache() after each weight update (scaler.update())

songsong0425 · September 4, 2025, 6:32am

Okay. I’ll try it. Thank you so much for your rapid help!

songsong0425 · September 4, 2025, 7:00am

I tried it, and the reserved GPU memory was decreased! Thanks a lot!
By the way, when I enlarge the number of neighbors from the LinkNeighborLoader in torch_geometric, it still showed a similar problem.

(Please check the detailed information about LinkNeighborLoader in here)

For example, current num_neighbors was set to [256,128], which means sampling 256 neighbor nodes in the 1st layer and 128 neighbors in the 2nd layer from source and target nodes in edges, respectively.

To avoid the problem, I also set the batch size to 4. (It means that the loader samples the four seed edges)

But when I increased it to [1024, 512] assuming that larget number of neighbors can help the model performance, the size of reserved memory is still large which evoked OOM problem.

FOLD 0
Data processing...
Done
[GPU] After model init | Allocated: 4.60 MB | Reserved: 22.00 MB
Starting to train and test model
  0%|                                                  | 0/6410 [00:00<?, ?it/s][GPU] After loading datasets on the GPU | Allocated: 98.83 MB | Reserved: 118.00 MB
[GPU] After forward | Allocated: 21705.27 MB | Reserved: 94168.00 MB
[GPU] After backward | Allocated: 173.14 MB | Reserved: 94168.00 MB
[GPU] After update | Allocated: 182.58 MB | Reserved: 94168.00 MB
  0%|                                        | 1/6410 [00:01<2:49:40,  1.59s/it][GPU] After loading datasets on the GPU | Allocated: 171.69 MB | Reserved: 540.00 MB
[GPU] After forward | Allocated: 14151.16 MB | Reserved: 53666.00 MB
[GPU] After backward | Allocated: 170.62 MB | Reserved: 53666.00 MB
[GPU] After update | Allocated: 170.62 MB | Reserved: 53666.00 MB
  0%|                                        | 2/6410 [00:01<1:33:05,  1.15it/s][GPU] After loading datasets on the GPU | Allocated: 175.92 MB | Reserved: 222.00 MB
[GPU] After forward | Allocated: 61866.08 MB | Reserved: 99010.00 MB
  0%|                                        | 2/6410 [00:04<3:45:31,  2.11s/it]
Traceback (most recent call last):
  File "/scratch/workspace/src_27_RGAT_neg/ForGitHub/NegativeRatio/train_weightedloss_fixmetrics_addMLP_subgraphs_weightededges_AMP.py", line 331, in <module>
    train(args)      
  File "/scratch/workspace/src_27_RGAT_neg/ForGitHub/NegativeRatio/train_weightedloss_fixmetrics_addMLP_subgraphs_weightededges_AMP.py", line 100, in train
    scaler.scale(tr_loss_mean).backward()
  File "/scratch/.conda/envs/notebook/lib/python3.10/site-packages/torch/_tensor.py", line 626, in backward
    torch.autograd.backward(
  File "/scratch/.conda/envs/notebook/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/scratch/.conda/envs/notebook/lib/python3.10/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 47.45 GiB. GPU 0 has a total capacity of 139.72 GiB of which 43.45 GiB is free. Including non-PyTorch memory, this process has 96.26 GiB memory in use. Of the allocated memory 72.12 GiB is allocated by PyTorch, and 23.41 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I guess that it looks hard to resolve the problem because of the size of the subgraph, but is there any other tip?

Arunprakash-A · September 4, 2025, 7:22am

Sorry, I don’t have experience in Graph Neural Networks. I see you are already using mixed precision training. Therefore, you can check whether a few more memory-efficient techniques (trading of speed) such as activation checkpointing (or activation recomputation) and gradient accumulation can be used to reduce memory requirement further.

songsong0425 · September 4, 2025, 7:28am

I see. I’ll find the other strategies, as you suggested. Thanks again!

Mohamed_Ali_Bouchhio · September 11, 2025, 1:21am

You can try distributed training if you have more than one gpu.

songsong0425 · September 12, 2025, 4:39am

Sorry, but I have only one GPU.