Big matrix multiplication on GPU eats up my CPU memory

I am doing a correspondance calculation between two matrices. One of these vectors has a fixed shape of [256, 5000] and the other one is variable of shape [256, n]. Here n is usually smaller than 100.

Both matrices are on the gpu and I do the calculation as follows:

 # mat1, mat2 are on cuda
correspondance_mat = mat1.unsqueeze(-1) * mat2.unsqueeze(1) 

This results in a matrix of shape [256, 5000, n].

The strange problem I am having now is that every time n increases to a larger value than ever before, a huge increase in CPU RAM occures. This used memory also does not seem to be realeased anymore, and it looks like its leaking. This even gets my computer to freeze when it reaches 100% memory occupation.

My question now is what is happening here?
1.) Since both matrices are on GPU why does it even affect my CPU RAM???
2.) Why does it not release the memory anymore after it is finished with this multiplication?
3.) Could this be a bug/memory leak in pytorch?

My Pytorch version is 1.4 and I am using it with Python 2.7. Unfortunately there is no way I can upgrade this version.

I cannot reproduce this issue using:

import torch
import psutil

a = torch.randn(256, 5000, 1, device='cuda')

for epoch in range(10):
    for n in range(1, 100):
        b = torch.randn(256, 1, n, device='cuda')
        out = a * b
        print('Epoch {} Iter {}'.format(epoch, n))
        print(psutil.virtual_memory().used)
        print('GPU allocated mem {:.3f}MB'.format(torch.cuda.memory_allocated()/1024**2))

and get:

Epoch 0 Iter 1
16522018816
GPU allocated mem 9.767MB
Epoch 0 Iter 2
16522534912
GPU allocated mem 15.119MB
[...]
Epoch 0 Iter 98
16523194368
GPU allocated mem 483.494MB
Epoch 0 Iter 99
16523194368
Epoch 1 Iter 1
16523194368
GPU allocated mem 9.767MB
Epoch 1 Iter 2
16523194368
GPU allocated mem 15.119MB
[...]
Epoch 9 Iter 98
16523194368
GPU allocated mem 483.494MB
Epoch 9 Iter 99
16523194368

Which shows an overall host memory increase of (16523194368 - 16522018816) / 1024**2 ~= 1.12MB.

EDIT: I would generally recommend to update PyTorch to the latest release especially since you are also using the deprecated Python2.7 version.

Thank you for the quick reply.

I use a Jetson AGX Xavier with CUDA 10.0 in combination with ROS which unfortunately limits me to this configuration including Python 2.7. I will maybe build a newer version with Pytorch from source and try to setup ROS with Python 3 if I have time. However, I will first investigate if this is due to the Pytorch version or maybe Python 2.7 itself. I will investigate this and report back.

For now I fixed this leak(?) by splitting up the matrix multiplication in a for loop and using a max size for n of 30 in each step. This does not result in RAM memory building up until overflow.