Dynamic number of output labels during the training

Hi

I am training a model continuously and the user may add new output labels during the training. For sake of clarity, lets say that the base number of labels is 3 and that we use a simple linear model with an input embedding dimension of 5. Hence the weight matrix of the linear layer has shape (3,5).

I wonder what is the best strategy to take into account that current number of labels may be higher in the future.

  1. When initializing the model, we use a bigger output dimension, e.g 10 instead of 3 and hope that the user will not add more than 7 labels. Hence we have a matrix of shape (10,5) but only the first 3 rows will really be used at the beginning of the training. I see several issues : a) waste of compute (we do useless computations since the 7 last rows correspond to labels that are currently not present in the dataset) and b) the model may learn to set those 7 last rows as full of zeros since those labels are never predicted. Then when I will add a 4th label and the 4th row will start to be trained, it wont backpropagate anything because of those zeros.
  2. When the user add a new label, I simply extend the previous weight matrix (3,5) into a new matrix (4,5), keep the first 3 rows as the old matrix but randomly initialize the 4th row properly (with kaiming init for instance).

To my knowledge the 2) seems the best one but I’ve never really seen anyone doing it so maybe I missed some important issues.

I also thought that it would not work when I use adaptative optimizer like ADAM since they have one learning rate for each parameter, so if I add new parameters adam will not work anymore but maybe I can do some surgery on the optimizer as well similarly to what i’ve done with the model.

Any advice ?