First, I want to mention, that this is our first project in a bigger scale and therefore we don’t know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method(‘spawn’) because otherwise I will get the following error:
“RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the ‘spawn’ start method”
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to ‘spawn’ the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: “cannot allocate memory” which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
#classify.py import torchvision.transforms as transforms from skimage import io import time from torch.utils.data import Dataset from .loader import * from .ResNet import * #if this part is in the classify() than no allocation problem occurs net = ResNet152(num_classes=25) net = net.to('cuda') save_file = torch.load("./model.pt", map_location=torch.device('cuda')) net.load_state_dict(save_file) def classify(imgp=""): #do some classification with the net pass if __name__ == '__main__': mp.set_start_method('spawn') #if commented out the first error ocours manager = mp.Manager() return_dict = manager.dict() p = mp.Process(target=classify, args=('./bild.jpg', return_dict)) p.start() p.join() print(return_dict.values())
Any help here will be much appreciated. Thank you.