Mixed input (image data and 1x3 array) and image output deep learning

I have an input image and input RGB color with an image output. I want to isolate only a region of interest from the image and train the model to fit with the ROI of the output image. How to implement this model?