Recovering token ids from normalized input?

Trying to figure out conceptually what is wrong here. I have a flow that does the following:

Text → Produce Token Ids → Normalize Ids → AutoEncoder → Calculate CosineEmbeddingLoss.

This process seems to work and ultimately completes the task but I cannot reproduce any of the inputs as the token ids are normalized so tokenizer.decode() does not work. Is there a better way to do this?

Relevant code:

class AE(nn.Module): 
  def __init__(self):
    self.encoder = torch.nn.Sequential(
      torch.nn.Linear(512, 512), # Input is in the format (Batchx512) 
      torch.nn.Linear(512, 256),
    self.decoder = torch.nn.Sequential(
      torch.nn.Linear(256, 512),
      torch.nn.Linear(512, 512),

  def forward(self, x):
    x = self.encoder(x)
    x = self.decoder(x)
    return x

And training

  def training_step(self, batch, batch_idx):
    x = batch
    x_hat =
    loss_fn = nn.CosineEmbeddingLoss()
    loss = loss_fn(x_hat, x, torch.Tensor([1.]))
    return loss

I was thinking to do F.normalize in the encoder but again I am not sure how to undo that transform witht he decoder or how I would emit outputs. Or do I need to swap nn.Sigmoid with nn.ReLU? (Seems CosineSim is scaling sensitive, so not sure if I’d need to swap my loss)

Could you pl. elaborate a bit about what happens in Normalize Ids step?

Sure. I am using the RobertaTokenizer from HuggingFace. From my text examples I will load the input_ids via dataset['train'] = tokenizer(dataset['train'], padding='max_length', truncation=True).input_ids and then normalize that data.

Ex: dataset['train'] = F.normalize(dataset['train'].float()). I’ve tried mapping the input_ids to their float equivalent from after normalization but this does not work b/c the Sigmoid can output any arbitrary float not just those of normalized input_ids.

Additionally, if I try using ReLU as my final output activation I find the cosine scores look ok but seem to produce small values (0.6…) rather than (0, 32100) which is my range of input ids.

I am not sure about your usecase.
But if you want to reconstruct the original values from normalized values, you can just multiply the normalized values by the norm.
i.e., store torch.norm(dataset['train'].float()) and later multiply by it.

That almost did the trick!

I am trying to do token reconstruction with an AutoEncoder. So the task is… given my model’s output, minimize the cosine distance between the generated and real text. The problem when I multiply by norm is that the predicted values are sometimes greater than my maximum vocab size (ex: 32100). It also seems that if I try to validate this by recovering my offset its slightly lossy in that 4-5 tokens might get mangled (im guessing due to long-> float conversion)


test_norm = torch.linalg.norm(tokenized_test['code'].float(), dim=1)
tokenized_test = F.normalize(tokenized_test['code'].float())
decode_text = test_example[0] * test_norm[0].to(device) # where test_example[0] is my first input_id sequence

This works well and gets me 99% of my tokenized_test string back. However if I make a prediction its unclear what norm id use. Would I use the norm of the row which was used for the example input? Or the mean of test/train?

My prediction output has a range of 0, 1 (I now clamp the last output with Sigmoid vs ReLU bc I realized this would cause numeric instability since I’ve normalized). When I look at the two tensors I see the following:

# Test Tensor
tensor(0., device='cuda:0')
tensor(0.2775, device='cuda:0')
tensor(0.0109, device='cuda:0')

# Prediction Tensor
tensor(0., device='cuda:0', grad_fn=<UnbindBackward0>)
tensor(0.9348, device='cuda:0', grad_fn=<UnbindBackward0>)
tensor(0.0211, device='cuda:0', grad_fn=<MeanBackward0>)

So if I scale this tensor by test_example[0]'s norm I will get values well above my 32100 vocab size (91774...).

I imagine I am just doing something wrong with my translation of data and that perhaps I should use an encoding that is in the range of (0,1) vs (0, 32100)?

When dealing with words (and its tokens) and similar discrete value scenarios, I have come across architectures that mostly use softmax final layer. i.e., the prediction ends up as a K-way classifier, where K = number of words/possible discrete values & trained with cross-entropy loss.
Is there any reason for you not following such an approach?

I see. I figured since my manual validation would be cosine similarity (1.0 = both tensors are the same), that should be my training metric. Because my data is normalized I’d need BCELoss + Sigmoid right? This method seems to work but hits the token recovery issue again.

Re: Softmax + Cross-Entropy. It seems like if I do not use normalized data (allowing inputs to range from (0, 32100)) training does not converge beyond a loss of 4e+6. It also seems like my outputs typically range from (0,1) rather than the full (0,32100) spectrum. Or am I misunderstanding and still need to normalize the input token ids?

    self.encoder = torch.nn.Sequential(
      torch.nn.Linear(512, 512), # Input is in the format (64x512) (Batch Size, Shape)
      torch.nn.Linear(512, 500), # I see many people do not apply activation on the output of encoder, is this ok?
    self.decoder = torch.nn.Sequential(
      torch.nn.Linear(500, 512),
      torch.nn.Linear(512, 512), # Ditto since I am using softmax.


  def training_step(self, batch, batch_idx):
    x = batch.float()
    x_hat =
    loss_fn = nn.CrossEntropyLoss()
    loss = loss_fn(x_hat, x)

I think my output is incorrect. I see negatives and max values < 1.0. If I use MSE loss explodes (I assume because of the input values) and does not ever settle below 1e+7.

With normalizing It’s also interesting that this works.

test = example * test_norm[0]
test = test.long()
print(tokenizer.decode(test, skip_special_tokens=True))

But I cannot recover example with a discrete mapping from 0, 32100 mapped to (0,1). Much less the prediction outputs. Ex:

reconstructed_tokens = []
lookup = list(M.keys()) # Where M holds a mapping from (0,1) -> (0,32100)

for p in tx:
  reconstructed_tokens.append(bisect.bisect(lookup, p.detach()))

res = torch.clamp(torch.tensor(reconstructed_tokens), 0, vocab_size-1)

Do you have any idea as to why many of these multi class CrossEntropy models include sigmoid? It seems like guidance typically says to pass raw linear values into the loss function, but I have also run into this sigmoid layer and have yet to figure out why. If anything, it seems, to me at least, it would more quickly introduce rounding error, but maybe there is some logic I’m missing?

This is because we want to truncate the result. If we have a linear output we could have exploding weights within the network whereas we want to say given N classes what is the probability of activation.

In this case (Cross Entropy) we have 2 options Cross Entropy or Cross Entropy with logits. Passing your raw values into CE is fine bc sigmoid will be applied, but CE w/ logits will not be needed if you already have applied sigmoid at the end. I’d imagine there is some performance implications of unbounded activation pre-CE calculation, so if you know you will be using CE, sigmoid makes sense.

1 Like