Training a sentence-level classifier for document-level classification using majority voting within the same model

Hi everyone,

After having read many helpful posts on this forum over the past years, this is my first shot at asking a question here myself. I think it’s great that this community exists!

I am struggling with the following problem: I have a very simple bi-LSTM-model that classifies sentences (nn.LSTM() followed by a single linear layer that maps to 5 classes), while ‘real world data’ is likely to consist of longer sequences (i.e., a couple of sentences or a paragraph). Such data is not available in my situation, but the data that I have (i.e., individual, shuffled sentences) are representative for the problem.

My idea is to implement a sort of ‘majority voting’ mechanism. During training, for every batch, the loss would be calculated for every individual sample using CrossEntropy as usual. But, additionally, all samples of the same label within the batch could be grouped to perform a ‘majority vote’, for example by averaging or summing softmaxes over the linear layer output. During inference, the majority vote is the model’s ‘final prediction’, by which I would calculate the model’s accuracy; during training, divergence of a sample from the majority vote would lead to an additional loss term being calculated (e.g., using KL divergence).

How should I implement this?

Specifically, I have the following questions:

  1. Is this a good idea?
  2. Is incorporating an extra loss term for divergence from the majority vote even helpful? Intuitively, I can imagine that it might stimulate “agreement”, or, that the model can learn “to put together different pieces of evidence” (if every sentence may contain different hints that only taken together reveal the class all sentences belong to, while taken individually, these ‘hints’ may be ambiguous and characteristic of multiple classes, which is actually the case for my specific data). Would it not do too much harm when the majority vote is actually wrong? Is KL divergence a good choice?
  3. Should I stick with random batches, which contain a varied number of samples for every class label? I.e., in one batch the majority vote is based on 3 samples, in another it may be 1 or 7. Since I don’t know how ‘real data’ would look like (this is a theoretical problem), I would like to report the model’s performance on different majority voting group sizes; this would allow me to do that ‘on the fly’? Moreover, I would think that it could make the model more robust for different lengths of input during inference. However, I am not too sure… the other option could be is to make the dataloader always serve batches that contain a fixed number of samples per label - but that ruins the idea of batches being randomised in the first place?

To provide some extra context (just in case): at first, I already trained the simple sentence-level model. However, I observed that the model was ‘cheating’ a bit on sentence length. That’s why I am now actually aiming to train all of the above not on sentences, but on ngrams of sentences of a fixed length. At the same time, I do not have too much data, and this would give me more data to train on as well. The inputs are not word embeddings, but one-hot vectors of part-of-speech tags (one at every timestep of the LSTM) of 50 dimensions. The sequences of these vectors are definitely not always unique to a single class, but they may be characteristic of one.

Thank you guys very much for having read this, and I am really eager to learn about how I can best achieve this!

~ Damiaan