Python 3.8 RAM owerflow and loading issues

Hi community,

First, I want to mention, that this is our first project in a bigger scale and therefore we don’t know everything but we learn fast.

We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.

But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method(‘spawn’) because otherwise I will get the following error:

“RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the ‘spawn’ start method”

Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to ‘spawn’ the ERROR disappears but the Jetson starts to allocate way to much memory.

Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: “cannot allocate memory” which is obvious.

I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.

#classify.py
    import torchvision.transforms as transforms
    from skimage import io
    import time
    from torch.utils.data import Dataset
    from .loader import *
    from .ResNet import *

    #if this part is in the classify() than no allocation problem occurs
    net = ResNet152(num_classes=25)
    net = net.to('cuda')
    save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
    net.load_state_dict(save_file)
    
    def classify(imgp=""):
        #do some classification with the net
        pass

    if __name__ == '__main__':
        mp.set_start_method('spawn') #if commented out the first error ocours
        manager = mp.Manager()
        return_dict = manager.dict()
        p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
        p.start()
        p.join()
        print(return_dict.values())

Any help here will be much appreciated. Thank you.

I’m not familiar with your complete use case, but why do you need to use multiprocessing?
Wouldn’t it work, if you initialize the model once and e.g. listen on a socket for new data in the main process?

Thanks for your response, and sorry that I answer so latly, but this problem is solved. In deed I managed the conection between the two processes incorectly.

Hi Martin,

Could you please elaborate on how you got rid of the RAM problem? I’m trying to do inference on a lot of image frames using mobilenet_v2 classification model, but the issue I face is the overload of the RAM and SWAP memory even with such a small model like mobilenet_v2.

I’m using the Jetson Nano 2GB Developer Kit. Any help would be greatly appreciated.