U-net connected to camera for live segmentation

I recenlty managed to train (not amazingly) an multiclass u-net module. I was looking around github but I could not find a relatively straightforward example of how to feed a tranined model (saved as a ‘pt’ file) to a live camera through for instance opencv. I was having a look to [DISCOGAN] (https://github.com/ptrblck/DiscoGAN/blob/master/discogan/run_inference.py) but I must admit that not all the code is not entirely clear to me.

What could be ideal just out of curiosity to start is to open opencv and feed the live frames as as a sort of (very naively):

prediction= model(inputs)

pred = prediction.data.cpu().numpy()

where I guess the inputs should be the frame currently on the live-video and as pred I should get the probability mask for the different classes.

If somebody could direct me torward an example or show a snippet of code would be amazing.
Thanks in advance!

My dummy example uses OpenCV to get the current frame, detect faces in the image, and crop it based on the faces. For visualization I played around with PyGame, as it seemed to work OK and I wanted to take a look at it.

Could you point me to specific lines of code, which are unclear?

After looking more in detail to your code now is much more clear and I managed to modify it with my needings. Sorry for the bothering :).

I noticed that however the model used from here U-net is extremely big and therefore I cannot have a “live segmentation” because it’s too slow to process the current frame. I guess the only issue in this would be to reduce the size of the network. Is there a simple way to reduce the size of this U-net and see with training if its still work in a decent way?

Thanks in advance!

Good to hear, you figured it out! :slight_smile:

The simplest way would be to reduce the channels by a constant factor and check, if the model is still training and gets a speedup.

Could you show me a sort of example or code snippet please? I could not find much and it’s not completely clear how to proceed.

Thanks a lot!

The UNet definition you’ve linked uses a defined number of input and output channels for the convolution blocks. You could simply divide all in and out channels by 2 to reduce the number of parameters.