CUDA running out of memory despite nvidia-smi saying the oposite

Hello everyone,
recently I created a script that uses maskRCNN net to do instance segmentation over and over again.
I do the setup as this:

device = None
model = None

def init_maskRCNN():
    global device
    global model
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    model = torchvision.models.detection.maskrcnn_resnet50_fpn(weights=MaskRCNN_ResNet50_FPN_Weights.DEFAULT).to(device)

which turns OK, device is set to '‘cuda:0’ (RTX3060, 12GB VRAM). Then I do repeatedly call a function, which gets inference, that goes (simplified) as:

def inference_maskRCNN(path):
    img =

    trans =  T.Compose([T.ToTensor()])
    img = trans(img)
    img =

    prediction = model([img])

    if (prediction[0]['scores'][0].size == 0) or (prediction[0]['scores'][0] < THRESHOLD):
        del img
        del prediction
        return [], [], []

    prediction_score = list(prediction[0]['scores'].detach().cpu().numpy())
    pred_t = [prediction_score.index(x) for x in prediction_score if x>THRESHOLD][-1]
    if len(prediction[0]['masks']) != 1:
        masks = (prediction[0]['masks']>0.5).squeeze().detach().cpu().numpy()
        masks = (prediction[0]['masks']>0.5).detach().cpu().numpy()
        masks = masks[0, :, :, :]
    prediction_class=[COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(prediction[0]['labels'].cpu().numpy())]
    pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(prediction[0]['boxes'].detach().cpu().numpy())]
    masks = masks[:pred_t+1]
    prediction_class = prediction_class[:pred_t+1]
    pred_boxes = pred_boxes[:pred_t+1]
    del img
    del prediction
    return masks, prediction_class, pred_boxes

using this i can get several hundreds of inferences, during which, when i run nvidia-smi, i get:

| NVIDIA-SMI 515.76 Driver Version: 515.76 CUDA Version: 11.7 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 On | N/A |
| 30% 48C P2 42W / 170W | 1949MiB / 12288MiB | 1% Default |
| | | N/A |
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 0 N/A N/A 3093 G /usr/lib/xorg/Xorg 187MiB |
| 0 N/A N/A 3298 G /usr/bin/gnome-shell 46MiB |
| 0 N/A N/A 4686 G …3/usr/lib/firefox/firefox 175MiB |
| 0 N/A N/A 25349 C python 1535MiB |

(sorry about the whitespaces, couldnt get them right).

After some time, however, i got:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.68 GiB (GPU 0; 11.77 GiB total capacity; 7.90 GiB already allocated; 2.03 GiB free; 8.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

anyway. Does anyone have an idea on what am I doing wrong? I ve read that only del may not be enough, hence I do the empty_cache() calling, yet it does not help either.

Thank you,

How about changing the batch size lower?
Sry, I am not the expert.

Thanks, but that will not help as

  1. I put the i puts there one by one and
  2. as I stated, it works perfectly fine for several hundreds iterations. Hence I think there must be some mem leak or incorrectly released memory.

Check how large the memory increase is in each iteration and compare it to your expected increase (e.g. caused by storing some model outputs or predictions). If this increase is too large, check if you are storing some tensors which are still attached to the computation graph and detach them before storing.