Linear Layer memory leak

When I run a basic linear layer, I encounter a strange memory leak. See the code below:

from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo
import torch
import gc

def get_free_space(idx=0):
    nvmlInit()
    h = nvmlDeviceGetHandleByIndex(idx)
    info = nvmlDeviceGetMemoryInfo(h)
    return info.free


linear_layer = torch.nn.Linear(768, 768).to("cuda:0")

with torch.no_grad():
    print(get_free_space(0))
    a_detach = torch.zeros((128, 129, 768)).to("cuda:0")
    d = linear_layer(a_detach)
    del d
    del a_detach

    gc.collect()
    torch.cuda.empty_cache()
    print(get_free_space(0))

Output:
7680950272
7639007232

The code run within the no_grad loop should ONLY be creating two tensors a_detach and d, both of which are promptly deleted. So why was the additional 50MB of memory lost?

Note that this leak continues to occur even after deleting the layer itself! (del linear_layer after the other dels).

perhaps for cublas kernels | context | buffers - this should be a one-time thing (i.e. lazy library init)

Is there a way to explicitly call this in advance? I need to ensure replicable memory usage.

just do some linear algebra operation:
a=torch.zeros(2,2,device=“cuda”)
a.mm(a)

though I don’t see a point, this initialization is usually deterministic