Trying to understand out of memory error at second epoch

Hey,
here I try to, as detailled as possible, to describe my out of memory error.
I am currently trying to implement a large scale super-resolution model that incorporates residual blocks and a subpixel upscaling layer.
My model has 42 million parameters and operates on an input image of size (3, 180, 320) with a batch size of 1 since it is so large that only a single image can pass through at once. Evaluation works fine with this model but training produces an out of memory error during the second epoch (i.e. the second image) that is passed through the network. I am trying to understand why exactly this happens since I tried to check which tensors remain in memory to no avail.
I won’t post the whole network code since it is long and complex and I don’t think the error lies in the network code (correct me if I’m wrong and I can send the network code too).
Here is the training loop:

while True:
        #this function loads an image from a video and in this case this is a list of a list
        #of a single numpy array of shape (1080, 1920, 3)
        img = video_loader.load_batch(1,1,1, rpn_model.num_input_imgs, 0.0, 0.0, seed = 19)

        #here is the training function, the name comes from this script being
        #a test of how large the model can be during training
        test_scale(img, rpn_model, downscale_factor, bicubic_scale_factor, net_scale_factor, optimizer)

and the content of the training loop (I tried to comment it as clearly as possible, if you have questions, please ask)

def test_scale(img, rpn_model, downscale_factor, bicubic_scale_factor, net_scale_factor, optimizer):
    #img is a full hd, rgb color image as a numpy array with shape (1080, 1920, 3)
    #rpn_model is a 43M parameter resnet subpixel super resolution model
    #downscale_factor = 2, bicubiy_scale_factor = 1.5, net_scale_factor = 3, optimizer = Adam
    print(torch.cuda.memory_allocated() / 1000 / 1000)
    print(torch.cuda.max_memory_allocated() / 1000 / 1000)
    print(torch.cuda.memory_reserved() / 1000 / 1000)
    print(torch.cuda.max_memory_reserved() / 1000 / 1000)
    print(torch.cuda.memory_summary())

    optimizer.zero_grad()
    #crop the input image by reducing height and with by factor of 2
    new_width = int(img[0][0].shape[1] / downscale_factor)
    new_height = int(img[0][0].shape[0] / downscale_factor)
    offset_x = int((img[0][0].shape[1] - new_width) / 2)
    offset_y = int((img[0][0].shape[0] - new_height) / 2)
    img_crop = img[0][0][offset_y:offset_y + new_height,
                         offset_x:offset_x + new_width,
                         :]

    #convert numpy array to tensor, this is the target HR image, shape is (1, 3, 540, 960)
    img_crop_tens = tensorfy_img(img_crop,img_crop.shape[1], img_crop.shape[0])

    #scale down image and scale it up again to introduce information loss
    smallest_img = cv2.resize(img_crop,
                              dsize=(int(img_crop.shape[1] / net_scale_factor / bicubic_scale_factor),
                                     int(img_crop.shape[0] / net_scale_factor / bicubic_scale_factor)))

    input_img = cv2.resize(img_crop, dsize=(int(img_crop.shape[1] / net_scale_factor),
                                            int(img_crop.shape[0] / net_scale_factor)))

    #convert input image to tensor, shape is (1, 3, 180, 320)
    input_img = tensorfy_img(input_img, input_img.shape[1], input_img.shape[0])

    #create model output, output is single tensor
    model_output = rpn_model(input_img)

    #Simple mean absolute error, function inputs are swapped compared to torch loss order
    loss = network.SimpleRLoss(img_crop_tens, model_output)
    print("Loss during batch training:", loss.item())
    loss.backward()
    optimizer.step()

    #these deletions and cache empty calls do not fix the error, even when detaching all tensors beforehand
    #del img_crop_tens
    #del input_img
    #del model_output
    #del loss
    #del rpn_model
    #del img_crop, smallest_img, img
    #del optimizer
    #torch.cuda.empty_cache()

    #this function shows that the model weights (or gradients?) are apparently not cleared from memory
    #dump_tensors

as for helper functions in this script, we have the numpy array to tensor function

def tensorfy_img(img, width, height):
    img_tensor = torch.from_numpy(np.transpose(img, (2,0,1))).float().to(device=comp_device)
    img_tensor = img_tensor.reshape(1,3,height, width)
    return img_tensor

and the loss function

def SimpleRLoss(in_hr, out_hr):
    #recon_loss = nn.MSELoss()
    mae_loss = nn.L1Loss()
    loss = mae_loss(out_hr, in_hr)
    return loss

and the dump tensors function taken from another forum post (which I sadly do not have the link anymore, sorry)

def dump_tensors(gpu_only=True):
    """Prints a list of the Tensors being tracked by the garbage collector."""
    import gc
    total_size = 0
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj):
                if not gpu_only or obj.is_cuda:
                    print("%s:%s%s %s" % (type(obj).__name__, 
                                          " GPU" if obj.is_cuda else "",
                                          " pinned" if obj.is_pinned else "",
                                          pretty_size(obj.size())))
                    total_size += obj.numel()
            elif hasattr(obj, "data") and torch.is_tensor(obj.data):
                if not gpu_only or obj.is_cuda:
                    print("%s → %s:%s%s%s%s %s" % (type(obj).__name__, 
                                                   type(obj.data).__name__, 
                                                   " GPU" if obj.is_cuda else "",
                                                   " pinned" if obj.data.is_pinned else "",
                                                   " grad" if obj.requires_grad else "", 
                                                   " volatile" if obj.volatile else "",
                                                   pretty_size(obj.data.size())))
                total_size += obj.data.numel()
        except Exception as e:
            pass        
    print("Total size:", total_size)

The individual elements work in another script (with slightly different arrangement) for a smaller network fine, even for a large number of epochs. This makes me wonder if there is some base level of memory allocation that can’t be cleared and requires me to downsize my network.
Here are some memory stats before the first epoch:

168.154624 #memory allocated in MB
168.154624 #max memory allocated in MB
190.840832 #memory reserved in MB
190.840832 #max memory reserved in MB
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  164213 KB |  164213 KB |  164213 KB |       0 B  |
|       from large pool |  163872 KB |  163872 KB |  163872 KB |       0 B  |
|       from small pool |     341 KB |     341 KB |     341 KB |       0 B  |
|---------------------------------------------------------------------------|

everything looks fine so far.
Here are the same stats before the second epoch:

673.896448 #memory allocated in MB
6103.84896 #max memory allocated in MB
6316.621824 #memory reserved in MB
6320.8161279999995 #max memory reserved in MB
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 3            |        cudaMalloc retries: 3         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  658102 KB |    5821 MB |   22377 MB |   21734 MB |
|       from large pool |  656736 KB |    5819 MB |   22373 MB |   21732 MB |
|       from small pool |    1366 KB |       2 MB |       3 MB |       2 MB |
|---------------------------------------------------------------------------|

As you can see, there is now approx. 660 MB of memory still allocated even after deleting all variables/tensors and emptying the cache. And when I dump the tensors I can see that there are still some weights/gradients(?) in memory:

Tensor: GPU pinned 1 × 3 × 540 × 960
Tensor: GPU pinned 1 × 3 × 180 × 320
Tensor: GPU pinned 1 × 3 × 540 × 960
Tensor: GPU pinned
Tensor: GPU pinned 256 × 3 × 9 × 9
Tensor: GPU pinned 256 × 3 × 9 × 9
Tensor: GPU pinned 256
Tensor: GPU pinned 256
Tensor: GPU pinned 256 × 256 × 3 × 3
Tensor: GPU pinned 256 × 256 × 3 × 3
Tensor: ...
Parameter: GPU pinned 256 × 3 × 9 × 9
Parameter: GPU pinned 256
Parameter: GPU pinned 256 × 256 × 3 × 3
Parameter: GPU pinned 256
Parameter: GPU pinned 256 × 256 × 3 × 3
Parameter: GPU pinned 256
Parameter: ...

So it seems that there is some leftover where there shouldn’t be one, and these 500 MB of memory is what I need to train this network.
I read a lot of these kinds of posts and tried to debug more but I am at the limit of my knowledge. Is this standard behaviour and I just need to downsize my model or am I doing something wrong?
Any help is appreciated! Thank you.

So I looked into this a bit more and found some interesting stuff:

  1. With my 40M parameter model, the memory used is increasing from approx 160MB to approx 640MB, so a factor of 4.
  2. This factor of 4 is the same factor for a model with half the parameters. Just the amount of allocated memory also gets halved, i.e. approx. 85MB to 340MB.
  3. Using SGD instead of ADAM decreases this factor to 2.
  4. Dumping the tensors (using ADAM) before the first epoch returns 40M numbers consisting only of parameters. Dumping the tensors after one epoch results in 120M numbers consisting of parameters and tensors.
  5. Doing the same using SGD only shows 40M numbers consisting of only parameters during all epochs, however memory still increases by a factor of 2 as mentioned above.

This leads me to believe that there is something going with model initialization which I don’t understand. I suspect that the model needs 320MB of memory (because ADAM needs another 320MB to keep track of learning rates for weights and biases). But this doesn’t explain why before training but after initialization the model needs 160MB of memory but after training it needs 320MB. What exactly is being held in memory when I delete everything else? Maybe it’s gradients? But aren’t these save in the optimizer? And wouldn’t gradients also need 320MB?
I do not believe that there is a bug somewhere and that all of this is normal behaviour but I would like to understand what exactly is happening here. Could anyone maybe explain?