What happens to memory when moving tensor to GPU?

I’m trying to understand what happens to the both RAM and GPU memory when a tensor is sent to the GPU.

In the following code sample, I create two tensors - large tensor arr = torch.Tensor.ones((10000, 10000)) and small tensor c = torch.Tensor.ones(1). Tensor c is sent to GPU inside the target function step which is called by multiprocessing.Pool. In doing so, each child process uses 487 MB on the GPU and RAM usage goes to 5 GB. Note that the large tensor arr is just created once before calling Pool and not passed as an argument to the target function. Ram usage does not explode when everything is on the CPU.

I have the following questions on this example:

  1. I’m sending torch.Tensor.ones(1) to GPU and yet it consumes 487 MB of GPU memory. Does CUDA allocate minimum amount of memory on the GPU even if the underlying tensor is very small? GPU memory is not a problem for me, and this is just for me to understand how the allocation is done.

  2. The problem lies in the RAM usage. Even though I am sending a small tensor to the GPU, it appears as if everything in memory (large tensor arr) is copied for every child process (possibly to pinned memory). So when a tensor is sent to the GPU, what objects are copied to pinned memory? I’m missing something here as it does not make sense to prepare everything to be sent to GPU when I’m only sending a particular object.

Thanks!

from multiprocessing import get_context
import time
import torch

dim = 10000
sleep_time = 2
npe = 4  # number of parallel executions

# cuda
if torch.cuda.is_available():
    dev = 'cuda:0'
else:
    dev = "cpu"
device = torch.device(dev)


def step(i):
    c = torch.ones(1)
    # comment the line below to see no memory increase
    c = c.to(device)
    time.sleep(sleep_time)


if __name__ == '__main__':
    arr = torch.ones((dim, dim))

    # create list of inputs to be executed in parallel
    inp = list(range(npe))

    # sleep added before and after launching multiprocessing to monitor the memory consumption
    print('before pool')  # to check memory with top or htop
    time.sleep(sleep_time)

    context = get_context('spawn')
    with context.Pool(npe) as pool:
        print('after pool')  # to check memory with top or htop
        time.sleep(sleep_time)

        pool.map(step, inp)

    time.sleep(sleep_time)