Potential Memory Leak when summing tensors on cuda

Hello All,

I suspect a possible memory leak when transferring tensors to GPU, possibly because tensors on the host side are still present

My code is as follows -

import torch

a = torch.randn(1, 8, 512, 512).cuda()
b = torch.randn(1, 8, 512, 512).cuda()

c = a + b

Pytorch Version -


When this code runs, My RAM climbs up to ~7.5GB from 2.5 GB and one core of the CPU shows 100% utilization and htop shows 7-8 of the same processes

To confirm whether the addition operation indeed happens on the GPU, the output of nvprof (ran as nvprof python3 test.py) in addition to GPU Memory climbing to 1500MB and 8% utilization as shown in nvidia-smi

==10381== NVPROF is profiling process 10381, command: python3 test.py
==10381== Profiling application: python3 test.py
==10381== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.76%  2.6353ms         2  1.3177ms  1.3113ms  1.3240ms  [CUDA memcpy HtoD]
                    2.24%  60.447us         1  60.447us  60.447us  60.447us  void at::native::vectorized_elementwise_kernel<int=4, at::native::BinaryFunctor<float, float, float, at::native::AddFunctor<float>>, at::detail::Array<char*, int=3>>(int, float, float)
      API calls:   99.88%  3.09314s         2  1.54657s     976ns  3.09314s  cudaStreamIsCapturing
                    0.07%  2.3108ms         2  1.1554ms  1.1335ms  1.1773ms  cudaMemcpyAsync
                    0.02%  511.45us         2  255.73us  237.41us  274.04us  cudaStreamSynchronize
                    0.01%  359.07us         2  179.54us  178.82us  180.25us  cudaGetDeviceProperties
                    0.01%  293.14us         2  146.57us  130.30us  162.84us  cudaMalloc
                    0.01%  194.50us       101  1.9250us      98ns  83.414us  cuDeviceGetAttribute
                    0.00%  57.419us         1  57.419us  57.419us  57.419us  cuDeviceGetName
                    0.00%  21.633us         1  21.633us  21.633us  21.633us  cudaLaunchKernel
                    0.00%  17.040us        32     532ns     198ns  4.8480us  cudaGetDevice
                    0.00%  7.8370us         1  7.8370us  7.8370us  7.8370us  cuDeviceGetPCIBusId
                    0.00%  2.4060us         2  1.2030us     504ns  1.9020us  cudaSetDevice
                    0.00%  1.0090us         3     336ns     205ns     435ns  cudaGetLastError
                    0.00%     706ns         3     235ns     127ns     450ns  cuDeviceGetCount
                    0.00%     698ns         3     232ns     122ns     449ns  cudaGetDeviceCount
                    0.00%     608ns         2     304ns     104ns     504ns  cuDeviceGet
                    0.00%     566ns         3     188ns     120ns     320ns  cuDevicePrimaryCtxGetState
                    0.00%     281ns         1     281ns     281ns     281ns  cuDeviceTotalMem
                    0.00%     193ns         1     193ns     193ns     193ns  cuDeviceGetUuid

I have the following questions -

  1. Is there any issue in my code, which is causing RAM to increase by almost 5GB, with and without nvprof as well. ? I even tried performing the addition operation in with torch.no_grad(): in case pytorch was building the graph but to no avail.

  2. Why the high CPU usage, Is it because of assigning random values to the two tensors ?, as Ideally once .cuda() is called, the CPU usage should be very low, as even cudaMemcpy is a DMA call.


  1. I guess the memory increase is caused by the NVIDIA driver, which will be loaded in the first CUDA operation.

  2. You are sampling the values on the CPU so would see a CPU utilization. Use device='cuda' inside the tensor factory to sample the values on the GPU directly.