Hello All,
I suspect a possible memory leak when transferring tensors to GPU, possibly because tensors on the host side are still present
My code is as follows -
import torch
a = torch.randn(1, 8, 512, 512).cuda()
b = torch.randn(1, 8, 512, 512).cuda()
c = a + b
Pytorch Version -
'1.10.2+cu113'
When this code runs, My RAM climbs up to ~7.5GB from 2.5 GB and one core of the CPU shows 100% utilization and htop shows 7-8 of the same processes
To confirm whether the addition operation indeed happens on the GPU, the output of nvprof (ran as nvprof python3 test.py
) in addition to GPU Memory climbing to 1500MB and 8% utilization as shown in nvidia-smi
==10381== NVPROF is profiling process 10381, command: python3 test.py
==10381== Profiling application: python3 test.py
==10381== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 97.76% 2.6353ms 2 1.3177ms 1.3113ms 1.3240ms [CUDA memcpy HtoD]
2.24% 60.447us 1 60.447us 60.447us 60.447us void at::native::vectorized_elementwise_kernel<int=4, at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float>>, at::detail::Array<char*, int=3>>(int, float, float)
API calls: 99.88% 3.09314s 2 1.54657s 976ns 3.09314s cudaStreamIsCapturing
0.07% 2.3108ms 2 1.1554ms 1.1335ms 1.1773ms cudaMemcpyAsync
0.02% 511.45us 2 255.73us 237.41us 274.04us cudaStreamSynchronize
0.01% 359.07us 2 179.54us 178.82us 180.25us cudaGetDeviceProperties
0.01% 293.14us 2 146.57us 130.30us 162.84us cudaMalloc
0.01% 194.50us 101 1.9250us 98ns 83.414us cuDeviceGetAttribute
0.00% 57.419us 1 57.419us 57.419us 57.419us cuDeviceGetName
0.00% 21.633us 1 21.633us 21.633us 21.633us cudaLaunchKernel
0.00% 17.040us 32 532ns 198ns 4.8480us cudaGetDevice
0.00% 7.8370us 1 7.8370us 7.8370us 7.8370us cuDeviceGetPCIBusId
0.00% 2.4060us 2 1.2030us 504ns 1.9020us cudaSetDevice
0.00% 1.0090us 3 336ns 205ns 435ns cudaGetLastError
0.00% 706ns 3 235ns 127ns 450ns cuDeviceGetCount
0.00% 698ns 3 232ns 122ns 449ns cudaGetDeviceCount
0.00% 608ns 2 304ns 104ns 504ns cuDeviceGet
0.00% 566ns 3 188ns 120ns 320ns cuDevicePrimaryCtxGetState
0.00% 281ns 1 281ns 281ns 281ns cuDeviceTotalMem
0.00% 193ns 1 193ns 193ns 193ns cuDeviceGetUuid
I have the following questions -
-
Is there any issue in my code, which is causing RAM to increase by almost 5GB, with and without nvprof as well. ? I even tried performing the addition operation in
with torch.no_grad():
in case pytorch was building the graph but to no avail. -
Why the high CPU usage, Is it because of assigning random values to the two tensors ?, as Ideally once .cuda() is called, the CPU usage should be very low, as even cudaMemcpy is a DMA call.
TIA