CUDA memory explosion when using torch.cat or torch.stack

Whenever with big tensors I use torch.stack or torch.cat I start to get cuda Runtime Error. However writing custom jit script which does similar operation doesn’t result in memory issue.

Below is simple example:

import torch
from torch.jit import script
x =torch.rand([1000, 1000, 1000]).cuda()
y =torch.rand([1000, 1000, 1000]).cuda()
z =torch.rand([1000, 1000, 1000]).cuda()

torch.stack((x, y, z))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-8-515e22a6cd7a> in <module>
----> 1 torch.stack((x, y, z))

RuntimeError: CUDA out of memory. Tried to allocate 11.18 GiB (GPU 0; 11.93 GiB total capacity; 11.18 GiB already allocated; 354.81 MiB free; 11.18 GiB reserved in total by PyTorch)

This some reshaping which I need for my project written in jit runs without error.

@script
def get_grids(z_os, y_os, x_os, fs):
    empty = []
    for i in range(z_os.shape[0]//fs):
        ten_ = torch.stack((z_os[i*fs:(i+1)*fs, i*fs:(i+1)*fs, i*fs:(i+1)*fs], 
                            y_os[i*fs:(i+1)*fs, i*fs:(i+1)*fs, i*fs:(i+1)*fs], 
                            x_os[i*fs:(i+1)*fs, i*fs:(i+1)*fs, i*fs:(i+1)*fs]),3)
        
        empty.append(ten_)
    return torch.stack(empty)
res = get_grids(x, y, z, torch.tensor(10))
res.shape
torch.Size([100, 10, 10, 10, 3])

Usually I work with much bigger tensors . What is the explanation for such behavior? And is there better way to achieve what I want without using jit?