Using transformers for arbitrary sequences (events) and [CLS] embedding

This is going to be a little bit lengthier question, but I believe it might be useful for many trying to do something similar as there are very few non NLP - CV examples out there.

I’m trying to solve the problem of general sequence modeling. Let’s say you have an app and users who are using this app. Users can log food, can read content, can talk to their coach, can measure their weight and much more, but let’s limit it to that.

Since I have the timestamps, so there is an order to events. Each event on its own detailed data, for example:

  • coach message obviously contains text
  • food log can contain multiple foods from the database
  • weigh in contains a numerical value
  • content has some text too, but can also be viewed just as a content-id, but let’s ignore the content

My idea is to build event transformer networks:

  • the main network that will process events as embeddings, so each event type will have it’s own embedding, there will be positional embedding, day of the program embedding, dt between events embedding…
  • the user coach messages network, that will process the conversations, so you have an embedding for each message coming from already pretrained destilBERT network (average of output embeddings), again a positional embedding, or message number and dts
  • food logging network, that takes the sum of all the food embeddings for a given meal, again with dt and positional embeddings.

The idea is that the embedding network learns high-level behavior patterns, while the other two process the details of the information. It’s also easier for batching but also enables us to use each network separately for other tasks, for example something related to coaching or foods.

The main task is to predict engagement in the next week (actually I want to predict more things, but let’s limit it just to binary, will the user be engaged or not). The idea is to use outputs from all 3 networks for doing that prediction. Like in BERT, I was thinking about having a special classification embedding ([CLS]). So either all networks will use the same classification embedding or I will concatenate them and do the prediction.

In addition, I’d like that each network creates good general representations. So for messages network, I do a subtask where I randomly select a point in a conversation, let’s say at message number 30, I ditch all the messages after 30 and replace it with some other conversation or not. So the task is then to predict if the conversation is consistent, or is there a random change in a conversation. I believe that this task could force the network to have a generally good representation for a conversation, and this network could then be used for other tasks related to user coach messages.

For food we can mask one of the foods and try to predict it.

Conversation network is the first network I’m building and I will use it as an example where things go wrong. But the first set of general questions is:

  • Is there any reason that the transformer model is not well suited for processing the event feed?
  • Any reason why classification embedding idea is not valid? It seems to me like it’s better than taking the average of the outputs as attention will have an opportunity to learn that, for example last few events are much more important than some older events.

OK, now the implementation of the conversation network:

  • each message is represented by (dt_embedding, position_embedding, msg_embedding), dt embedding is just discretized time, so for example idx=1 could be representing interval from 0 to 10 seconds and is represented by 768 dimensional vector, idx=200 could be representing the interval from 2 days to infinity. The same goes for the positional embeddings, I don’t do any sin/cos representations, I assumed that the network can learn this. Unlike dt and position embeddings, I already have the msg_embeddings representation from the distil BERT and I’ve precomputed these vectors in advance for each message. So each timestep is represented by 768 dimensional vector, which is a sum of dt_embedding + position_embedding + msg_embedding, first two are trained, last one is fixed.
  • in addition I randomly initialize the CLS tensor, which I prepend to the sequence, so this CLS tensor also goes through transformer network and then I use output from that position to go to classification network.
  • I also have a “split” embedding that represents where did I randomly changed (or not) the sequence.

Finally some code:

    dataset = HDF5NegativeSampling(settings.HDF5_TRANSFORMED_PATH,
                                   columns=[ 'msg_no', 'dt', 'message'],
    cls_embedding = nn.Embedding(2, 768, padding_idx=0)
    msg_no_embedding = nn.Embedding(768, 768, padding_idx=0, scale_grad_by_freq=True)
    dt_embedding = nn.Embedding(201, 768, padding_idx=0, scale_grad_by_freq=True)
    conversation_embeddings = {
                               'dt': dt_embedding,
                               'msg_no': msg_no_embedding}  # this is a possition embedding
    conversation_model = ConversationModel(
        conversation_embeddings, 768, cls_embedding,
        num_heads=4, num_layers=3, concat=False, hidden_size=2048, dropout=0.1).cuda()
    classifier = Classifier(input_size=768, output_size=1).cuda()

So here are the networks:

class ConversationModel(nn.Module):
    Model that takes sentence embeddings and runs them through transformer block.
    def __init__(self, sequence_embeddings, embedding_dim, cls_embedding,
                 num_heads, num_layers, concat=False, hidden_size=2048, dropout=0.1):
        :sequence_embeddings: dict, {feature_name: embedding layer}
        :embedding_dim: embedding dimension
        :cls_embedding: classification embedding
        :num_heads: transformer model parameter
        :num_layers: transformer model parameter
        :concat: should all the embeddings be concatenated, if False, they are summed
        :hidden_size: transformer model parameter
        :dropout: transformer model parameter
        super(ConversationModel, self).__init__()
        self.sequence_embedding = sequence_embeddings
        self.cls_embedding = cls_embedding
        self.concat = concat
        transformer_encoder = TransformerEncoderLayer(embedding_dim, num_heads, dropout=dropout)
        self.transformer_stack = TransformerEncoder(transformer_encoder, num_layers)
        if concat:
            # + 1 due to cls_embedding
            self.linear1 = nn.Linear((len(sequence_embeddings) + 1) * embedding_dim, hidden_size)
            self.linear2 = nn.Linear(hidden_size, embedding_dim)

        self.embedding_list = nn.ModuleList()
        for embedding in self.sequence_embedding.values():

    def forward(self, input_dict):
        :input_dict: dict of tensors. Feature names for shared sequance embeddings
        - dt, msg_no...
        if self.concat is False:
            shared_embeddings = [self.sequence_embedding[name](indices)
                                 for name, indices in input_dict.items()
                                 if name in {'msg_no'}]  # self.sequence_embedding]
            # shared_embeddings = []
            embeddings = torch.stack(shared_embeddings, dim=0).sum(dim=0)
            # TODO could be a bit tricky, requires a loop to be applied over each timestep
            embeddings = None
        cls_embedding = self.cls_embedding(
            torch.tensor([1], dtype=torch.int64).cuda()).repeat(embeddings.shape[0], 1, 1)
        # so when cls goes through transformer I'm going to use it with classification network
        output =[cls_embedding, embeddings], dim=1)
        return self.transformer_stack(output)

The classification network:

class Classifier(nn.Module):
    def __init__(self, input_size, hidden_size=None, output_size=1, dropout=0.1):
        :input_size: size of an input
        :hidden_size: if we want additional hidden layer, if None classification directly
            goes from the input
        :output_size: output size
        :dropout: dropout probability

        Both can be set to None if no numerical or categorical features.
        super(Classifier, self).__init__()
        self.hidden_size = hidden_size
        if hidden_size:
            self.fc_hidden = nn.Linear(input_size, hidden_size)
            input_size = hidden_size
        self.fc_output = nn.Linear(input_size, output_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        if self.hidden_size:
            x = nn.functional.elu(self.fc_hidden(x))
        x = self.dropout(x)
        return self.fc_output(x)

Relevant part from the training loop:

self.loss = nn.BCEWithLogitsLoss()

def _forward(self, data):
        output = self.conversation_model(data)[:, 0, :]  # takes output of the CLS embedding to predict
        output = self.classifier(output)
        probabilities = torch.sigmoid(output)
        return output, probabilities

def calculate_loss(self, data, output, running_loss):
        loss = self.loss(output[:, 0], data['label'].type(torch.float32))
        running_loss += loss.item()
        return running_loss, loss

outputs, _ = self._forward(data)
running_loss, loss = self.calculate_loss(data, outputs, running_loss)

In other words, I’m pushing all the texts and prepended CLS embedding through the transformer, I take the 0th output, which should represent the transformed CLS embedding, use it as an input to classification network.

This all seems logical to me, but the unfortunate thing is that the loss is ln(2) and it is not moving down. While the task is not trivial, I believe that a human could do this with over 90% accuracy.

For the debugging purposes, I’ve try adding the 768 dimensional tensor of all 0.01s after the split point to network every time a label is 0 (random continuation). So I have a leaked label indicator, but network still stays at ln(2) loss value. When I check if CLS embedding has changed at all as a concequence of training, it has, but it’s just not informative enough.

The alternative I have tried is that instead having a CLS embedding I just average the outputs, and still the network doesn’t learn in general. BUT when I leak the label as explained above, then it starts learning after a while.

Any advice or obvious errors?

Well… this is embracing, but unfortunately the TrannsformerEncoderLayer documentation doesn’t seem to contain the information (rather the Transformer class does) that the input is (sequence length, batch, embedding dim), instead, I pushed (batch, sequence length, embedding dim).

The reason why it trained at all was that I would either select the CLS embedding that was 768 dimensional or averaged across the dimension so that I get the 768, which was required for the classification network. And the miracle that average would actually start training when I leaked the label… well, neural networks work in a mysterious ways.

In any case, any additional advice on general sequence modeling is welcome.