Single image and text input CNN with no ground truth

Hello there!
I’m very new to PyTorch and using its functionalities, but I’m working on a thesis regarding 3d facial animation. What I need is an algorithm that takes a single image (image shows which part of the face is not posed correctly yet) and maybe even the name of the controller (which is known) that controls this part of the face, and outputs two 3-dimensional vectors with one being the translation and the other being the rotation of the controller. I do not have a ground truth value to compare against, but I do have a very precise percentage of how close to perfect the animation is.
So here was my plan:
Create a CNN that takes an image and a text as the input, predicts some values for the two 3-dim vectors, then I get the percentage (instead of a loss function) and then do the backward propagation.
Does this make any sense? And am I even correct in using a CNN for this kind of task?
I also had a lot of issues understanding CNNs so how exactly would a text input work with the image? And how would such an algorithm be structured?

Thank you so much for your time and have a great day!!