Supervised joint image and text classification

I have 800 natural images with their associated text (a tweet) for each image. I also have a label (10 class) for these joint images and text. I have the following questions:

  1. Should I use a 4098d feature vector using ResNet and then a 300d feature vector for Word2Vec for the images and texts accordingly?
  2. How can I feed these two numpy vectors (or tensors) to a network? Is there a minimum working architecture that you might have that does the same?
  3. Do I need to use something like CCA for joint image and text?
  4. For the associated tweet with the image, do I need to create the word2vec for each word and then average the vectors? is there a better way for this? Like tweet2vec? Is there similar code for so in PyTorch? Like I give you a text (not a word) and you do all the pre-processing (tf-idf/removing stop words/stemming, etc) and give me a vector?
    Please let me know if you might have any further suggestions.

P.S.: My dataset classes are severely imbalanced. Some classes have 3, 10, 12 images while some have 170, 80, and 50 ish images. What can be done for this situation?