Hi, always thank you for your valuable comments.
I have a question about the GPU memory usage, especially reserved memory.
When I run the RGATConv-based model in the torch_gemoetric, the size of the reserved memory increased rapidly.
I defined the GPU memory screening function as below:
def print_gpu_usage(note=“”):
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / (1024 ** 2)
reserved = torch.cuda.memory_reserved() / (1024 ** 2)
print(f"[GPU] {note} | Allocated: {allocated:.2f} MB | Reserved: {reserved:.2f} MB")
else:
print(“CUDA is not available.”)
And inserted the function like this:
model.train()
for data in tqdm(train_loader):
tr_loss =
data = data.to(device)
data = subgraph_sampling_per_cellline(data)
print_gpu_usage(“After loading datasets on the GPU”)
if tr_loss:
scaler.scale(tr_loss_mean).backward()
print_gpu_usage(“After backward”)
scaler.step(optimizer)
scaler.update()
print_gpu_usage(“After update”)
Then, the result looks weird.
FOLD 0Data processing…Done[GPU] After model init | Allocated: 4.60 MB | Reserved: 22.00 MBStarting to train and test model0%| | 0/6410 [00:00<?, ?it/s][GPU] After loading datasets on the GPU | Allocated: 77.64 MB | Reserved: 80.00 MB[GPU] After forward | Allocated: 13020.38 MB | Reserved: 15252.00 MB[GPU] After backward | Allocated: 155.22 MB | Reserved: 33714.00 MB[GPU] After update | Allocated: 164.39 MB | Reserved: 33714.00 MB
0%| | 1/6410 [00:00<1:33:09, 1.15it/s][GPU] After loading datasets on the GPU | Allocated: 162.91 MB | Reserved: 33714.00 MB[GPU] After forward | Allocated: 9008.47 MB | Reserved: 33714.00 MB[GPU] After backward | Allocated: 161.62 MB | Reserved: 33714.00 MB[GPU] After update | Allocated: 161.62 MB | Reserved: 33714.00 MB
0%| | 2/6410 [00:01<46:35, 2.29it/s][GPU] After loading datasets on the GPU | Allocated: 163.68 MB | Reserved: 33714.00 MB[GPU] After forward | Allocated: 12093.81 MB | Reserved: 52474.00 MB[GPU] After backward | Allocated: 158.96 MB | Reserved: 61854.00 MB[GPU] After update | Allocated: 158.96 MB | Reserved: 61854.00 MB
0%| | 3/6410 [00:01<33:34, 3.18it/s][GPU] After loading datasets on the GPU | Allocated: 157.91 MB | Reserved: 61854.00 MB[GPU] After forward | Allocated: 11848.42 MB | Reserved: 61858.00 MB[GPU] After backward | Allocated: 164.30 MB | Reserved: 61858.00 MB[GPU] After update | Allocated: 164.30 MB | Reserved: 61858.00 MB
0%| | 4/6410 [00:01<26:08, 4.08it/s][GPU] After loading datasets on the GPU | Allocated: 165.70 MB | Reserved: 61858.00 MB[GPU] After forward | Allocated: 15850.53 MB | Reserved: 61858.00 MB[GPU] After backward | Allocated: 163.28 MB | Reserved: 61858.00 MB[GPU] After update | Allocated: 163.28 MB | Reserved: 61858.00 MB
0%| | 5/6410 [00:01<24:15, 4.40it/s][GPU] After loading datasets on the GPU | Allocated: 162.42 MB | Reserved: 61858.00 MB[GPU] After forward | Allocated: 14547.31 MB | Reserved: 61860.00 MB[GPU] After backward | Allocated: 161.62 MB | Reserved: 61862.00 MB[GPU] After update | Allocated: 161.62 MB | Reserved: 61862.00 MB
0%| | 6/6410 [00:01<22:03, 4.84it/s][GPU] After loading datasets on the GPU | Allocated: 162.35 MB | Reserved: 61862.00 MB[GPU] After forward | Allocated: 15718.04 MB | Reserved: 61862.00 MB[GPU] After backward | Allocated: 166.97 MB | Reserved: 61862.00 MB[GPU] After update | Allocated: 166.97 MB | Reserved: 61862.00 MB
0%| | 7/6410 [00:01<20:58, 5.09it/s][GPU] After loading datasets on the GPU | Allocated: 166.33 MB | Reserved: 61862.00 MB[GPU] After forward | Allocated: 15294.10 MB | Reserved: 61862.00 MB[GPU] After backward | Allocated: 165.38 MB | Reserved: 61864.00 MB[GPU] After update | Allocated: 165.38 MB | Reserved: 61864.00 MB
0%| | 8/6410 [00:02<20:10, 5.29it/s][GPU] After loading datasets on the GPU | Allocated: 164.57 MB | Reserved: 61864.00 MB[GPU] After forward | Allocated: 9452.27 MB | Reserved: 61864.00 MB[GPU] After backward | Allocated: 164.03 MB | Reserved: 61864.00 MB[GPU] After update | Allocated: 164.03 MB | Reserved: 61864.00 MB
0%| | 9/6410 [00:02<18:24, 5.79it/s][GPU] After loading datasets on the GPU | Allocated: 165.56 MB | Reserved: 61864.00 MB[GPU] After forward | Allocated: 13367.78 MB | Reserved: 72252.00 MB[GPU] After backward | Allocated: 159.69 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 159.69 MB | Reserved: 82640.00 MB
0%| | 10/6410 [00:02<18:23, 5.80it/s][GPU] After loading datasets on the GPU | Allocated: 158.75 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 12570.48 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 168.32 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 168.32 MB | Reserved: 82640.00 MB
0%| | 11/6410 [00:02<18:11, 5.86it/s][GPU] After loading datasets on the GPU | Allocated: 167.36 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 11526.41 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 166.10 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 166.10 MB | Reserved: 82640.00 MB
0%| | 12/6410 [00:02<17:12, 6.20it/s][GPU] After loading datasets on the GPU | Allocated: 167.82 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 11734.72 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 168.06 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 168.06 MB | Reserved: 82640.00 MB
0%| | 13/6410 [00:02<17:08, 6.22it/s][GPU] After loading datasets on the GPU | Allocated: 167.23 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 10795.68 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 164.24 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 164.24 MB | Reserved: 82640.00 MB
0%| | 14/6410 [00:02<16:43, 6.37it/s][GPU] After loading datasets on the GPU | Allocated: 163.03 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 8074.03 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 162.18 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 162.18 MB | Reserved: 82640.00 MB
0%| | 15/6410 [00:03<15:50, 6.73it/s][GPU] After loading datasets on the GPU | Allocated: 162.43 MB | Reserved: 82640.00 MB[GPU] After forward | Allocated: 8811.66 MB | Reserved: 82640.00 MB[GPU] After backward | Allocated: 160.45 MB | Reserved: 82640.00 MB[GPU] After update | Allocated: 160.45 MB | Reserved: 82640.00 MB
As you can see, the reserved memory is much more than the allocated memory.
Since I’m using the large size of the knowledge graph (i.e., heterogeneous graph), it is a critical issue to deal with the GPU memory size.
I guess that the reserved memory evoked the GPU OOM problem.
How can I minimize the reserved memory? As much as possible, I want to make the reserved GPU memory dynamic.
Thank you for reading this question.
—
[Update]
When I try to use PYTORCH_NO_CUDA_MEMORY_CACHING=1 like here, it returned weird results as below.
FOLD 0
Data processing...
Done
[GPU] After model init | Allocated: 0.00 MB | Reserved: 0.00 MB
Starting to train and test model
0%| | 0/6410 [00:00<?, ?it/s][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After backward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After update | Allocated: 0.00 MB | Reserved: 0.00 MB
0%| | 1/6410 [00:02<3:37:10, 2.03s/it][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After backward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After update | Allocated: 0.00 MB | Reserved: 0.00 MB
0%| | 2/6410 [00:02<2:30:08, 1.41s/it][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After backward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After update | Allocated: 0.00 MB | Reserved: 0.00 MB
0%| | 3/6410 [00:06<4:02:45, 2.27s/it][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After backward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After update | Allocated: 0.00 MB | Reserved: 0.00 MB
0%| | 4/6410 [00:07<3:11:50, 1.80s/it][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After backward | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After update | Allocated: 0.00 MB | Reserved: 0.00 MB
0%| | 5/6410 [00:10<3:49:19, 2.15s/it][GPU] After loading datasets on the GPU | Allocated: 0.00 MB | Reserved: 0.00 MB
[GPU] After forward | Allocated: 0.00 MB | Reserved: 0.00 MB
0%| | 5/6410 [00:13<4:40:21, 2.63s/it]
Traceback (most recent call last):
File "/scratch/r902a02/workspace/KG-SLomics_Revision/src_27_RGAT_neg/ForGitHub/NegativeRatio/train_weightedloss_fixmetrics_addMLP_subgraphs_weightededges_AMP.py", line 332, in <module>
train(args)
File "/scratch/r902a02/workspace/KG-SLomics_Revision/src_27_RGAT_neg/ForGitHub/NegativeRatio/train_weightedloss_fixmetrics_addMLP_subgraphs_weightededges_AMP.py", line 102, in train
scaler.scale(tr_loss_mean).backward()
File "/scratch/r902a02/.conda/envs/notebook/lib/python3.10/site-packages/torch/_tensor.py", line 626, in backward
torch.autograd.backward(
File "/scratch/r902a02/.conda/envs/notebook/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "/scratch/r902a02/.conda/envs/notebook/lib/python3.10/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.