torch.nn.Embedding() for text2image generation

I am trying to generate an image of a license plate number using text input. For example, if an input is 04누0839, the model should return the following image.

I am using imagen-pytorch but it does not output satisfactory images. When trained using “unconditional Imagen” (without text embeddings), the output images are fair enough but input text is ignored.

When trained using text_embeddings (torch.nn.Embedding()):

word_set = [text for text in self.texts]
vocab = {tkn: i for i, tkn in enumerate(word_set)}
embedding_layer = nn.Embedding(len(vocab), 1)
self.text_embs = []
for num in tqdm(vocab.keys()):    
    lookup_tensor = torch.tensor([vocab[num]], dtype=torch.long)
    embedding = embedding_layer(lookup_tensor)
    # print(text_embs.shape)
self.text_embeds = torch.stack(self.text_embs)

the outputs are very unsatisfactory.

I think I use Embedding layer improperly. Here are my questions:

  1. How to use it correctly? Should I increase embedding_dim? (I am using 1 now).
  2. How about manual_seed? (Training with and without manual seed did not improve the performance, though).
  3. Is the embedding layer trained automatically when I train imagen or I should train it first on my data and then make embeddings as input to the imagen?
  4. Should I split the text (ex. 04누0839 as ‘0’, ‘4’, ‘누’…) and then create embeddings or it is fine to use the whole text to create embeddings?

An embedding layer provides a learnable vectorized set of weights to turn tokens into vectors. The nice thing about vectors is they can store information about how various words relate to one another. Like coffee and water might score closer in more vector points than coffee and hamburger, after training, of course.

Here is a good tutorial on nn.Embedding:

One thing I noticed in your code is that you are re-indexing any repeated words in self.text. Meaning that the same word may be indexed twice or more.

Additionally, the context which you’re applying nn.Embedding may be ill-suited. That’s because 0 and 4, for example, are very unilateral in their relationship with one another. I.e. you wouldn’t be able to describe either in terms of liquid, solid, gas, or verb, noun, etc. They do not have properties outside of their representation as a numerical identifier on a car plate.

Thank you, @J_Johnson! Can you please tell me how I should represent numerical identifiers (digits)?

Image-gen isn’t specifically trained on diffusing into license plate, so your results will likely be hit and miss. If you want something like that, perhaps try starting with training a stable diffusion model. Something like this:

Anyway, to your question, I’d retrain a new text encoder. With less than 100 chars, you could probably just pass in a one_hot embedding, assigning a given row vector to each letter/number, with a length of the max number of letter/numbers available.

The encoder should be very small, no more than 1M parameters for quicker training time. Likewise, the UNet should be small. You could likely remove a downsampling/upsampling layer or two since you probably don’t need more than HxW=64x32 pixel images as outputs. I would keep it under 20M parameters. That way you can train it on an Nvidia 3xxx or 4xxx in a day. Otherwise, you’ll need a big budget and lots of time to train a full UNet on denoising diffusion.