Recovering token ids from normalized input?

Trying to figure out conceptually what is wrong here. I have a flow that does the following:

Text → Produce Token Ids → Normalize Ids → AutoEncoder → Calculate CosineEmbeddingLoss.

This process seems to work and ultimately completes the task but I cannot reproduce any of the inputs as the token ids are normalized so tokenizer.decode() does not work. Is there a better way to do this?

Relevant code:

class AE(nn.Module): 
  def __init__(self):
    self.encoder = torch.nn.Sequential(
      torch.nn.Linear(512, 512), # Input is in the format (Batchx512) 
      torch.nn.Linear(512, 256),
    self.decoder = torch.nn.Sequential(
      torch.nn.Linear(256, 512),
      torch.nn.Linear(512, 512),

  def forward(self, x):
    x = self.encoder(x)
    x = self.decoder(x)
    return x

And training

  def training_step(self, batch, batch_idx):
    x = batch
    x_hat =
    loss_fn = nn.CosineEmbeddingLoss()
    loss = loss_fn(x_hat, x, torch.Tensor([1.]))
    return loss

I was thinking to do F.normalize in the encoder but again I am not sure how to undo that transform witht he decoder or how I would emit outputs. Or do I need to swap nn.Sigmoid with nn.ReLU? (Seems CosineSim is scaling sensitive, so not sure if I’d need to swap my loss)

Could you pl. elaborate a bit about what happens in Normalize Ids step?

Sure. I am using the RobertaTokenizer from HuggingFace. From my text examples I will load the input_ids via dataset['train'] = tokenizer(dataset['train'], padding='max_length', truncation=True).input_ids and then normalize that data.

Ex: dataset['train'] = F.normalize(dataset['train'].float()). I’ve tried mapping the input_ids to their float equivalent from after normalization but this does not work b/c the Sigmoid can output any arbitrary float not just those of normalized input_ids.

Additionally, if I try using ReLU as my final output activation I find the cosine scores look ok but seem to produce small values (0.6…) rather than (0, 32100) which is my range of input ids.

I am not sure about your usecase.
But if you want to reconstruct the original values from normalized values, you can just multiply the normalized values by the norm.
i.e., store torch.norm(dataset['train'].float()) and later multiply by it.

That almost did the trick!

I am trying to do token reconstruction with an AutoEncoder. So the task is… given my model’s output, minimize the cosine distance between the generated and real text. The problem when I multiply by norm is that the predicted values are sometimes greater than my maximum vocab size (ex: 32100). It also seems that if I try to validate this by recovering my offset its slightly lossy in that 4-5 tokens might get mangled (im guessing due to long-> float conversion)


test_norm = torch.linalg.norm(tokenized_test['code'].float(), dim=1)
tokenized_test = F.normalize(tokenized_test['code'].float())
decode_text = test_example[0] * test_norm[0].to(device) # where test_example[0] is my first input_id sequence

This works well and gets me 99% of my tokenized_test string back. However if I make a prediction its unclear what norm id use. Would I use the norm of the row which was used for the example input? Or the mean of test/train?

My prediction output has a range of 0, 1 (I now clamp the last output with Sigmoid vs ReLU bc I realized this would cause numeric instability since I’ve normalized). When I look at the two tensors I see the following:

# Test Tensor
tensor(0., device='cuda:0')
tensor(0.2775, device='cuda:0')
tensor(0.0109, device='cuda:0')

# Prediction Tensor
tensor(0., device='cuda:0', grad_fn=<UnbindBackward0>)
tensor(0.9348, device='cuda:0', grad_fn=<UnbindBackward0>)
tensor(0.0211, device='cuda:0', grad_fn=<MeanBackward0>)

So if I scale this tensor by test_example[0]'s norm I will get values well above my 32100 vocab size (91774...).

I imagine I am just doing something wrong with my translation of data and that perhaps I should use an encoding that is in the range of (0,1) vs (0, 32100)?

When dealing with words (and its tokens) and similar discrete value scenarios, I have come across architectures that mostly use softmax final layer. i.e., the prediction ends up as a K-way classifier, where K = number of words/possible discrete values & trained with cross-entropy loss.
Is there any reason for you not following such an approach?

I see. I figured since my manual validation would be cosine similarity (1.0 = both tensors are the same), that should be my training metric. Because my data is normalized I’d need BCELoss + Sigmoid right? This method seems to work but hits the token recovery issue again.

Re: Softmax + Cross-Entropy. It seems like if I do not use normalized data (allowing inputs to range from (0, 32100)) training does not converge beyond a loss of 4e+6. It also seems like my outputs typically range from (0,1) rather than the full (0,32100) spectrum. Or am I misunderstanding and still need to normalize the input token ids?

    self.encoder = torch.nn.Sequential(
      torch.nn.Linear(512, 512), # Input is in the format (64x512) (Batch Size, Shape)
      torch.nn.Linear(512, 500), # I see many people do not apply activation on the output of encoder, is this ok?
    self.decoder = torch.nn.Sequential(
      torch.nn.Linear(500, 512),
      torch.nn.Linear(512, 512), # Ditto since I am using softmax.


  def training_step(self, batch, batch_idx):
    x = batch.float()
    x_hat =
    loss_fn = nn.CrossEntropyLoss()
    loss = loss_fn(x_hat, x)

I think my output is incorrect. I see negatives and max values < 1.0. If I use MSE loss explodes (I assume because of the input values) and does not ever settle below 1e+7.

With normalizing It’s also interesting that this works.

test = example * test_norm[0]
test = test.long()
print(tokenizer.decode(test, skip_special_tokens=True))

But I cannot recover example with a discrete mapping from 0, 32100 mapped to (0,1). Much less the prediction outputs. Ex:

reconstructed_tokens = []
lookup = list(M.keys()) # Where M holds a mapping from (0,1) -> (0,32100)

for p in tx:
  reconstructed_tokens.append(bisect.bisect(lookup, p.detach()))

res = torch.clamp(torch.tensor(reconstructed_tokens), 0, vocab_size-1)